The Amazon Simple Storage Service (S3) team was debugging an issue
causing the S3 billing system to progress more slowly than
expected. At 9:37AM PST, an authorized S3 team member using an
established playbook executed a command which was intended to
remove a small number of servers for one of the S3 subsystems that
is used by the S3 billing process. Unfortunately, one of the
inputs to the command was entered incorrectly and a larger set of
servers was removed than intended. The servers that were
inadvertently removed supported two other S3 subsystems.
That’s one hell of a typo.
We are making several changes as a result of this operational
event. While removal of capacity is a key operational practice, in
this instance, the tool used allowed too much capacity to be
removed too quickly. We have modified this tool to remove capacity
more slowly and added safeguards to prevent capacity from being
removed when it will take any subsystem below its minimum required
capacity level. This will prevent an incorrect input from
triggering a similar event in the future.
A lot of system administrator tools are written without the equivalent of guardrails. Think about how much collective damage has been done from mistakes using the
rm command alone.
★ Thursday, 2 March 2017