A chosen excerpt from the highly educative article about BitCoin.
Bitcoin has entered exceptionally dangerous waters. Previous crises, like the bankruptcy of Mt Gox, were all to do with the services and companies that sprung up around the ecosystem. But this one is different: it is a crisis of the core system, the block chain itself. More fundamentally, it is a crisis that reflects deep philosophical differences in how people view the world: either as one that should be ruled by a “consensus of experts”, or through ordinary people picking whatever policies make sense to t
Source: The resolution of the Bitcoin experiment — Medium
The fundamental question raised by the article’s author, is quite relevant in many other contexts than BitCoin. It is well worth reflecting on.
It is clearly very hard to eliminate infrastructure outage, even for some of the biggest players in the industry. However, we are heading to an era where Cloud infrastructure may be ‘too big to fail’, are companies going to ensure they are ready for this? Ultimately, it is an issue of the economic value of risk. Those with sound risk management practice in place would have less to fear, I am not sure many have though.
The Cloud is becoming so essential to so many companies that there comes a point where provider’s infrastructure outage could cause serious liabilities. Every few months now a large Cloud provider experiences a technical incident that takes down many popular startup company web sites for several hours. These are not some odd amateur providers, we are talking about Amazon, Microsoft, Google, the biggest there is in this game. Such outages used to be the lot of Facebook or Twitter, those companies seem to have remarkably improved their infrastructure availability, it is the turn of smaller startups by way of their cloud providers.
It’s obviously very hard, if not impossible, to completely eliminate outages, but what surprises me is that these outages are taking a long time to recover from, for infrastructure serving hundreds of companies (if you consider the ripple effects).
A naive way to look at it would be to imagine that cloud providers are running specially crafted test lab that would continually run failure scenarios and teach the operations teams how to detect them, and hopefully leading to remediations that would be put in place before they are ever experienced in real-life. This may sound costly but it wouldn’t be for companies like Amazon or Google. Perhaps they actually do something like this. In this year alone, every time such Cloud incidents has occurred and were fully investigated, it turned out that the root cause could actually have been anticipated if not prevented. Arguably it’s very hard to stress and crash test a large server cluster, but these companies have the resources and know-how to model incident scenarios and run simulations. It may be that the growth rate is much higher than the occurrence of serious infrastructure incidents, making it a lower priority for provider to double down on incident prevention. I wonder then, should it be up to the users to plan for and protect themselves against such incidents?
I don’t want to oversimplify but I imagine it economical for those with high stake in the game to setup safety harnesses. The issue at hand is really that of the economic value of risk, easily determined for a business that trade by the hour, not so trivial for companies that make no money but are valued based on the user traffic they get. Those with sound risk management practice in place would have less to fear, I am not sure many startups have though.
If a company’s valuation is determined by the traffic they generate with no associated monetary transaction then an infrastructure outage (that can be blamed on someone else) may not have such a high economic impact. However, online advertisement is a big source of income for many startups, some sell goods and services online. For these companies an untimely outage means less visitor traffic which means missed income, and for such companies it may be critical to put in place some form of cloud outage safety harness.