Complexity, Failure, Systemic Risk, and Collapse

Our modern world is too complex to understand and run from the center.

One morning last month, to the surprise of many denizens of the Internet, when they rose from their beds and padded to their PCs in their pajamas many of their favorite online haunts simply weren't there.  From Reddit to patient-monitoring systems, every website running on one section of Amazon's Cloud Services had vanished.

And stayed vanished - in some cases, for days.

In the event, nothing really vital seems to have been harmed, except the pocketbooks and careers of companies and professionals who depended on Amazon.  After considerable effort, the cloud's back up and running for now.

In Amazon's very detailed public mea culpa lies a very frightening landmine of truth:

We will be making a number of changes... We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures.  [emphasis added]

Amazon's server business did not grow on a tree.  It was built from scratch by highly trained engineers, using hardware designed by other highly trained engineers and with software written by experienced coders.

Its purpose was to guarantee that Amazon could respond promptly to customers even during the frantic pre-Christmas buying season.  It not only handled Amazon's e-ordering peaks, it shrugged off an attack by "Anonymous" without hardly a blip.

The software, hardware, and system as a whole was tested thoroughly before Amazon decided to sell extra capacity to the public.  It's run quite well since then, with continued research, testing, and improvement.

Nevertheless, despite all the effort and intelligence that went into it, Cloud Services failed catastrophically in a totally unpredicted way.  Oh, Amazon's learned a lesson - "We now understand..." - but their system's complexity exceeded their ability to understand it even though they built it in the first place.

As Ars Technica points out:

The proof of the pudding is in the eating—the company won't know for certain if the problem is solved unless it suffers a similar failure in the future, and even if this particular problem is solved there may well be similar issues lying latent.

Put another way: not only does Amazon not know if they've properly fixed things, they cannot know it.  Their cloud is just too complex.

Our Complex, Fragile World

Modern technology has immeasurably improved human lives, but it's created new problems that we are totally unfamiliar with and, in many cases, may not even realize exist - and its very complexity is possibly the most threatening.

Prior to a hundred years ago, there were very few non-astronomical disasters that could affect the entire world, and none of them happened without warning.  A military genius like Genghis Khan might conquer a large chunk of the globe, but it would take many years with ample warning of his coming every step of the way.  Others might ignore impending threat or be overconfident, but the paranoid at least had a chance to run away.

Likewise, the Black Plague spread across Eurasia and there was nothing the medicine of the day could do to stop it, but those with resources did have warning and the ability to flee or to isolate themselves.

There certainly were unforeseen disasters, but they were all limited in scope.  The Lisbon earthquake and Great London Fire destroyed entire cities in moments, but the rest of each country was mostly undamaged, much less the world.

Even in the early technological era, the reach of any one disaster wasn't too great.  A railway bridge collapse could cut off a town for a few weeks, or a failed telegraph cable disconnect Europe and America from instant communications, but there were other ways around.  Famines were purely local and were made less severe with improved transportation and better farming technology.

Today, however, "the world is flat" and everything is interconnected.  The American housing bubble spread economic havoc over the entire world.  Nobody knows why it happened, so there's no guarantee that the recent changes in laws and regulations will do any good at all.

Save for the very most backward places, there's no such thing as famine anymore in the sense of food not being available.  Instead, we have an arguably worse problem: when food runs short due to bad weather in Russia or Americans turning too much corn into gasoline, food prices rise everywhere.  All the world's poor are priced out of eating at the same time.

There is a direct relationship between rising food prices and political revolutions; the more widespread the increases, the more revolts there will be.  Where once a famine might have led to revolution only in the immediate country, modern transport systems spread the problem across an entire region.

Alas, we are totally dependent upon our technology.  New England stood still for days in the Northeast Blackout of 2003; a century ago this wouldn't have been possible since the various city grids weren't connected.  Good news: plans are in place to tie the national grid closer together, so we can take down the whole country all at once.

Grids and interconnected networks appear all over the place where you'd never expect them.  The recent Japanese earthquake disasters wreaked havoc on Toyota and Honda's manufacturing supply chain.  No surprise there; they're Japanese companies.

Time for American car makers to rake in the dough, right?  Nope: GM had to shut down American plants because they buy parts from Japan, and GM can't make American cars without Japanese parts.

As the world ties closer and closer together, we become more vulnerable to failures on the other side of the globe that we can't control or even see.

In past times, there were potential disasters that could destroy an individual, town, or country, but at least people knew what they were and could pray to their God for protection from famine, pestilence, or whatever.  Now, totally unimagined technological failures can foul up or, conceivably, take down our entire global society.  Our technology is so complicated, so interconnected, and so hidden that we don't even know what to pray for protection from.

We'll have to upgrade the traditional Scottish prayer:

From ghoulies and ghosties
And long-leggedy beasties
And things that go glitch in the night,
Good Lord, deliver us!

Petrarch is a contributing editor for Scragged.  Read other Scragged.com articles by Petrarch or other articles on Society.
Reader Comments

I agree that the world is far more complex and integrated than ever before, but I think you're cherry-picking some of the facts to reach your conclusion.

You said:

"The Lisbon earthquake and Great London Fire destroyed entire cities in moments, but the rest of each country was mostly undamaged, much less the world"

Why not point out that entire cities no longer burn to the ground precisely because of technology and integrated systems? Nowadays, it would be impossible for something like the Great Chicago fire to spread across an entire city because our irrigation and communication system are far too good. If the fire was REALLY bad, one whole block might go down but nothing more.

You said:

"New England stood still for days in the Northeast Blackout of 2003; a century ago this wouldn't have been possible since the various city grids weren't connected."

Sure, but portions of New England *didn't have* any power a hundred years ago and the power that did exist was nothing like the durable, clean power that exists today. Americans forget what power was like before the interconnected/regulated grid system came about. Power went up and down all the time. Extra demand could not be met at a moments notice and there was far less capacity to route. Brown-outs were much more commonplace.

You said:

"As the world ties closer and closer together, we become more vulnerable to failures on the other side of the globe that we can't control or even see."

Perhaps, but those same failures have become much more predictable such that we can avoid them where before no one could have even seen them coming. Furthermore, an integrated world often means multiple vendors so that the failure can be quickly routed around with a few adjustments.

We can now rely on things "just being there" that never existed in the past. Things like international delivery service, for instance. Could a problem at a Chinese airport affect FedEx deliveries in the US? Sure, but 30 years ago you had NO OPTION AT ALL to get something overnighted from half way around the world for just a few bucks.

You said:

"The recent Japanese earthquake disasters wreaked havoc on Toyota and Honda's manufacturing supply chain. No surprise there; they're Japanese companies."

The problems in Japan are a sign that an integrated world is BETTER, not worse. 10 or 20 years ago, an earthquake near their factories would have WIPED OUT all production entirely, possibly for years until they rebuilt. Now that they have factories and supply chains in other countries/regions, they can quickly route around the problems back home and keep production moving. An interconnected world is the precisely why Japan can recover as quickly as they can/have. Toyota employs 200k people in the US, most of whom have been building cars same as ever during the cleanup.

It's a mistake to confuse the occasional problem caused by integration as a sign that systems would be better off non-integrated.

May 27, 2011 9:42 AM

It appears to me that both lfon and the article have very strong points. lfon correctly points out that many local issues can be solved incredibly quickly and more effectively than they could have prior to the interconnectedness of the world. A world wide food market does increase the chance of a global issue but markedly decreases the chance of local issues, at least within the industrialized world.

I do not wish to speak for the article author, but it appears to me that the article is not arguing that globalization is entirely bad without any benefits. Simply that it creates, as never before, an issue of all of humanity facing issues instead of small populations facing them individually.

As with so many other things, the bad comes with the good. This really just points out the urgent need to me for humanity to establish multiple self sustaining colonies off world that would, once again, create distinct areas that would not be, quite so severely at least, effected by a global catastrophic event. All of humanities proverbial eggs are once again in one basket and I for one am decidedly not a fan of that.

May 28, 2011 10:20 PM

Well said, jonyfries! I really haven't anything to add to your observation or your conclusion.

May 29, 2011 12:32 AM

There are problems with over-complex systems but the argument in this article is undermined by a number of false premises about the Amazon failure.

1.) Amazon does not understand their systems. ("[T]heir system's complexity exceeded their ability to understand it even though they built it in the first place.")

If you read through Amazon's remarkably open post-mortem it's clear that they were able to get the system back and running with minimal loss of data because they *did* understand how it worked and were able to recover as a result with minimal damage to user data. Not only that but they were pretty clear about how to fix the problems. We can be 100% certain that Amazon will have failures in the future but they will most likely be different from this one because the root causes will be fixed.

2.) The system failed catastrophically. ("Nevertheless, despite all the effort and intelligence that went into it, Cloud Services failed catastrophically in a totally unpredicted way.")

This was a storage failure. You can quibble about the meaning of the word "catastrophic" but for most engineers who work with storage technology it means your data go away and don't come back. In fact the scope of the failure was limited in two important ways by the system design.

a.) It was largely restricted to a single location, known as an "Availability Zone." Amazon is pretty open about the fact applications that use a single location are vulnerable to exactly the sort of outage that occurred. Applications that followed Amazon recommendations for switching processing to another location had minimal problems.

b.) In the failed location the storage volumes chose to become unavailable rather than risk data loss. This is a standard design trade-off that prevented data corruption. It's made on the assumption that users would rather trade being down for a while against losing all their data. As far as I can tell from the post-mortem this strategy was largely successful.

3.) We can reason from Amazon failures to conclude that it has broader lessons about the dangers of technology to society at large. Well maybe, and then only with great care. lfon addressed the centralization problem nicely but I would like to take issue with the idea that complex systems are necessarily bad.

Modern passenger aircraft are extremely complex machines that occasionally fail in surprising ways that result in loss of life. Even the Boeing 777, an otherwise incredibly safe aircraft, has suffered crashes due to conditions that the designers did not anticipate, such as the Heathrow crash of BA flight 38 due to icing in the fuel lines. We still have no explanation for the disappearance of MH370, which presumably resulted in the death of everyone on board.

Despite these problems people still board flights on a regular basis. In the US alone this happened over 700M times in 2016 according to US Bureau of Transportation Statistics.

So if we want to complain about complexity we also need to answer the question: what's the alternative? In the case of aircraft it means people simply wouldn't go to as many places as they do today or as safely (e.g., because they drive cars instead). Or we could choose to have aircraft that burn more fuel, crash more often, or carry fewer people. Aircraft are complex in part because they solve these problems.

The same question is relevant for Amazon: what's the alternative? Yes, it has occasional failures but without Amazon many of the businesses that people complained about losing would not exist at all. And don't get me started on running things in your own data center. Most corporate IT departments suffer far greater failure rates than Amazon does. That's one reason why Amazon is so popular.

May 27, 2017 4:24 PM
Add Your Comment...
4000 characters remaining
Loading question...