Oops: An errant keystroke to blame for Amazon’s cloud service disruption this week
In the end, it was human error, a simple keystroke, that led to hundreds of websites going offline on Tuesday of this week
Those damn humans again. It turns out someone made a keystroke error and removed mores servers than had been intended. The outage of Amazon Simple Storage Service (S3) in region EAST-1 caused a large number of websites to become unworkable and last several hours before the issue was resolved.
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” Amazon said. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”
“Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.”