The Cloud is becoming so essential to so many companies that there comes a point where provider’s infrastructure outage could cause serious liabilities. Every few months now a large Cloud provider experiences a technical incident that takes down many popular startup company web sites for several hours. These are not some odd amateur providers, we are talking about Amazon, Microsoft, Google, the biggest there is in this game. Such outages used to be the lot of Facebook or Twitter, those companies seem to have remarkably improved their infrastructure availability, it is the turn of smaller startups by way of their cloud providers.
It’s obviously very hard, if not impossible, to completely eliminate outages, but what surprises me is that these outages are taking a long time to recover from, for infrastructure serving hundreds of companies (if you consider the ripple effects).
A naive way to look at it would be to imagine that cloud providers are running specially crafted test lab that would continually run failure scenarios and teach the operations teams how to detect them, and hopefully leading to remediations that would be put in place before they are ever experienced in real-life. This may sound costly but it wouldn’t be for companies like Amazon or Google. Perhaps they actually do something like this. In this year alone, every time such Cloud incidents has occurred and were fully investigated, it turned out that the root cause could actually have been anticipated if not prevented. Arguably it’s very hard to stress and crash test a large server cluster, but these companies have the resources and know-how to model incident scenarios and run simulations. It may be that the growth rate is much higher than the occurrence of serious infrastructure incidents, making it a lower priority for provider to double down on incident prevention. I wonder then, should it be up to the users to plan for and protect themselves against such incidents?
I don’t want to oversimplify but I imagine it economical for those with high stake in the game to setup safety harnesses. The issue at hand is really that of the economic value of risk, easily determined for a business that trade by the hour, not so trivial for companies that make no money but are valued based on the user traffic they get. Those with sound risk management practice in place would have less to fear, I am not sure many startups have though.
If a company’s valuation is determined by the traffic they generate with no associated monetary transaction then an infrastructure outage (that can be blamed on someone else) may not have such a high economic impact. However, online advertisement is a big source of income for many startups, some sell goods and services online. For these companies an untimely outage means less visitor traffic which means missed income, and for such companies it may be critical to put in place some form of cloud outage safety harness.