Amazon mega-outage caused by single command line error

This week’s great Amazon Web Services (AWS) S3 outage which kiboshed up to 150,000 websites including Netflix, Spotify, Pinterest and Buzzfeed was caused by an engineer mistyping a single command, the company has admitted.

Diving into Amazon’s mea culpa in more detail, it seems the engineer was trying to temporarily take down servers used by the S3 billing sub-system when the command line mishap caused a cascading problem that downed two really critical servers.

The first was used to index metadata for the US-East-1 region that allows customers to issue GET, LIST, PUT, and DELETE requests. The other, which depended on the first working correctly, was the placement subsystem used to provision new storage. Explained Amazon:

Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.

To put it mildly. Doubtless, the engineer or engineers must have broken into a cold sweat as Amazon’s Lego of services started to topple over, taking with it the Amazon Elastic Compute Cloud (EC2), the S3 console, Amazon Elastic Block Store (EBS) volumes, the AWS Lambda and even the S3 API.

If this is starting to sound a tad dry, stay with us, because there is an interesting admission buried inside Amazon’s apology that is worthy of a figurative highlighter pen.

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact.

All well and good but Amazon then coughs up the tidbit that the company had never before “completely re-started” the two critical servers inadvertently taken offline this week. Why? In the rush of S3’s “massive growth” apparently there just wasn’t a convenient moment.

One might speculate that if you don’t take servers out of service, the full effects of doing this might not be apparent. Or perhaps they were very apparent but nobody dared touch them. Either way, bringing them back up was more time consuming than expected, from 9:37AM PST to 1:54PM PST if you count all affected services.

Amazon said it will make changes to “refactor parts of the service into smaller cells to reduce blast radius” (translation: stop things going wrong so quickly) but why did Amazon decide to document its cock-up in such gory detail?

Providers have become adept at apologising when things go awry without offering much more. The sands are now shifting. In an age when DDoS, hacking, sabotage and nation states are front of mind in so many disruptions, it’s become reassuring to be told that old-fashioned human error lies behind something big.

Amazon, then, looks a bit less like a faceless, humming warehouse-cum-datacentre and rather more fallible and human, like the rest of us. We also now realise in no uncertain terms that it’s staggeringly important for thousands of businesses customers and the muggle millions who lie beyond them.