Business

Amazon: Human error caused massive outage

March 3, 2017 at 2:36AM

SEATTLE – Amazon.com's cloud computing unit said that the outage that shook up a sizable part of the internet Tuesday was caused by human error.

The Amazon Web Services division said in a post-mortem published on its website Thursday that its team was working to fix a problem that slowed down the billing system for S3, a widely used AWS service.

Through S3, companies and individuals can store their data on Amazon's server farms. S3 also houses the data that underpins a wide array of other AWS services, including some computing processing functions. It works as a basic building block of Amazon's cloud, which in turn is a major pillar of the modern internet.

To fix the slowdown issue, engineers in AWS' Northern Virginia operation — one of the largest cluster of data centers run by the company — needed to take down a small number of servers.

"Unfortunately," as AWS put it in its lengthy mea culpa, a technician made a mistake when entering a command, taking out more servers than needed — some of which were critical to the functioning of S3 in the entire region. Thousands of users were affected.

AWS said its system is designed to allow the removal of big chunks of its components "with little or no customer impact." But the rebooting took longer than expected, partly because the S3 service has become gigantic since it launched more than a decade ago.

From failure to complete recovery, the outage lasted slightly more than four hours, although other AWS services that accumulated a backlog of work took longer to recover. AWS said the outage was prompting it to make some changes: for example, reducing the amount of server capacity that can be removed at one time.