Close window  |  View original article

Complexity, Failure, Systemic Risk, and Collapse

Our modern world is too complex to understand and run from the center.

By Petrarch  |  May 27, 2011

One morning last month, to the surprise of many denizens of the Internet, when they rose from their beds and padded to their PCs in their pajamas many of their favorite online haunts simply weren't there.  From Reddit to patient-monitoring systems, every website running on one section of Amazon's Cloud Services had vanished.

And stayed vanished - in some cases, for days.

In the event, nothing really vital seems to have been harmed, except the pocketbooks and careers of companies and professionals who depended on Amazon.  After considerable effort, the cloud's back up and running for now.

In Amazon's very detailed public mea culpa lies a very frightening landmine of truth:

We will be making a number of changes... We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures.  [emphasis added]

Amazon's server business did not grow on a tree.  It was built from scratch by highly trained engineers, using hardware designed by other highly trained engineers and with software written by experienced coders.

Its purpose was to guarantee that Amazon could respond promptly to customers even during the frantic pre-Christmas buying season.  It not only handled Amazon's e-ordering peaks, it shrugged off an attack by "Anonymous" without hardly a blip.

The software, hardware, and system as a whole was tested thoroughly before Amazon decided to sell extra capacity to the public.  It's run quite well since then, with continued research, testing, and improvement.

Nevertheless, despite all the effort and intelligence that went into it, Cloud Services failed catastrophically in a totally unpredicted way.  Oh, Amazon's learned a lesson - "We now understand..." - but their system's complexity exceeded their ability to understand it even though they built it in the first place.

As Ars Technica points out:

The proof of the pudding is in the eating—the company won't know for certain if the problem is solved unless it suffers a similar failure in the future, and even if this particular problem is solved there may well be similar issues lying latent.

Put another way: not only does Amazon not know if they've properly fixed things, they cannot know it.  Their cloud is just too complex.

Our Complex, Fragile World

Modern technology has immeasurably improved human lives, but it's created new problems that we are totally unfamiliar with and, in many cases, may not even realize exist - and its very complexity is possibly the most threatening.

Prior to a hundred years ago, there were very few non-astronomical disasters that could affect the entire world, and none of them happened without warning.  A military genius like Genghis Khan might conquer a large chunk of the globe, but it would take many years with ample warning of his coming every step of the way.  Others might ignore impending threat or be overconfident, but the paranoid at least had a chance to run away.

Likewise, the Black Plague spread across Eurasia and there was nothing the medicine of the day could do to stop it, but those with resources did have warning and the ability to flee or to isolate themselves.

There certainly were unforeseen disasters, but they were all limited in scope.  The Lisbon earthquake and Great London Fire destroyed entire cities in moments, but the rest of each country was mostly undamaged, much less the world.

Even in the early technological era, the reach of any one disaster wasn't too great.  A railway bridge collapse could cut off a town for a few weeks, or a failed telegraph cable disconnect Europe and America from instant communications, but there were other ways around.  Famines were purely local and were made less severe with improved transportation and better farming technology.

Today, however, "the world is flat" and everything is interconnected.  The American housing bubble spread economic havoc over the entire world.  Nobody knows why it happened, so there's no guarantee that the recent changes in laws and regulations will do any good at all.

Save for the very most backward places, there's no such thing as famine anymore in the sense of food not being available.  Instead, we have an arguably worse problem: when food runs short due to bad weather in Russia or Americans turning too much corn into gasoline, food prices rise everywhere.  All the world's poor are priced out of eating at the same time.

There is a direct relationship between rising food prices and political revolutions; the more widespread the increases, the more revolts there will be.  Where once a famine might have led to revolution only in the immediate country, modern transport systems spread the problem across an entire region.

Alas, we are totally dependent upon our technology.  New England stood still for days in the Northeast Blackout of 2003; a century ago this wouldn't have been possible since the various city grids weren't connected.  Good news: plans are in place to tie the national grid closer together, so we can take down the whole country all at once.

Grids and interconnected networks appear all over the place where you'd never expect them.  The recent Japanese earthquake disasters wreaked havoc on Toyota and Honda's manufacturing supply chain.  No surprise there; they're Japanese companies.

Time for American car makers to rake in the dough, right?  Nope: GM had to shut down American plants because they buy parts from Japan, and GM can't make American cars without Japanese parts.

As the world ties closer and closer together, we become more vulnerable to failures on the other side of the globe that we can't control or even see.

In past times, there were potential disasters that could destroy an individual, town, or country, but at least people knew what they were and could pray to their God for protection from famine, pestilence, or whatever.  Now, totally unimagined technological failures can foul up or, conceivably, take down our entire global society.  Our technology is so complicated, so interconnected, and so hidden that we don't even know what to pray for protection from.

We'll have to upgrade the traditional Scottish prayer:

From ghoulies and ghosties
And long-leggedy beasties
And things that go glitch in the night,
Good Lord, deliver us!