This week’s great Amazon Web Services (AWS) S3 outage which kiboshed up to 150,000 websites including Netflix, Spotify, Pinterest and Buzzfeed was caused by an engineer mistyping a single command, the company has admitted.
Diving into Amazon’s mea culpa in more detail, it seems the engineer was trying to temporarily take down servers used by the S3 billing sub-system when the command line mishap caused a cascading problem that downed two really critical servers.
The first was used to index metadata for the US-East-1 region that allows customers to issue GET, LIST, PUT, and DELETE requests. The other, which depended on the first working correctly, was the placement subsystem used to provision new storage. Explained Amazon:
Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.
To put it mildly. Doubtless, the engineer or engineers must have broken into a cold sweat as Amazon’s Lego of services started to topple over, taking with it the Amazon Elastic Compute Cloud (EC2), the S3 console, Amazon Elastic Block Store (EBS) volumes, the AWS Lambda and even the S3 API.
If this is starting to sound a tad dry, stay with us, because there is an interesting admission buried inside Amazon’s apology that is worthy of a figurative highlighter pen.
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact.
All well and good but Amazon then coughs up the tidbit that the company had never before “completely re-started” the two critical servers inadvertently taken offline this week. Why? In the rush of S3’s “massive growth” apparently there just wasn’t a convenient moment.
One might speculate that if you don’t take servers out of service, the full effects of doing this might not be apparent. Or perhaps they were very apparent but nobody dared touch them. Either way, bringing them back up was more time consuming than expected, from 9:37AM PST to 1:54PM PST if you count all affected services.
Amazon said it will make changes to “refactor parts of the service into smaller cells to reduce blast radius” (translation: stop things going wrong so quickly) but why did Amazon decide to document its cock-up in such gory detail?
Providers have become adept at apologising when things go awry without offering much more. The sands are now shifting. In an age when DDoS, hacking, sabotage and nation states are front of mind in so many disruptions, it’s become reassuring to be told that old-fashioned human error lies behind something big.
Amazon, then, looks a bit less like a faceless, humming warehouse-cum-datacentre and rather more fallible and human, like the rest of us. We also now realise in no uncertain terms that it’s staggeringly important for thousands of businesses customers and the muggle millions who lie beyond them.
I’m sure others have similar tales. When I was a neophyte UNIX operator, the system had slowed due to a log file that had filled the drive. I figured I could just rm the file, and it would recreate. I didn’t know about dev/null. Fortunately we had tech support for the system. At the time support logged through a modem, and they recreated the file. I have a couple more like that. I like to think that I can count them on one hand.
I give Amazon a lot of credit for coming forward with an explanation so quickly that appears to be plausible.
I’m surprised that nobody asked what was the ‘typo’? Anyone that works on this stuff likes to know what happened so they don’t do themselves… Remember the admin that did the script with something like “rm -f $A/$B” when both A and B were null? So inquiring minds want to know…. :O
We wrote about that “rm” SNAFU. It turned out to be a hoax – modern Linux sysutils assume that “rm – r /” on its own is a mistake and refuse to do it unless you als add “-no-preserve-root”.
And you would need to run that command with elevated privileges, other wise you would get a lot of permission denied errors on a majority of the system files and directories not owned by you.
I do know of one admin who typed something like this: rm -r . folder_to_delete/
The command was ran on a system without backups, and deleted far more than they intended to. It didn’t wipe out the whole system, but it might as well have. Everything that was important on that system was in the . folder. It created quite the headache for the company.