Service Incident Postmortem: Breakdown and Root Cause
By Fred Sadaghiani /
30 Aug 2014
We experienced an outage of our APIs from August 26th, 2014 from approximately 11:36PM PDT to 12:53AM PDT, August 27th, 2014. While we’ve outlined what happened and the impact to our customers, we’d like to detail the root cause, how we fixed the issue and what we’re doing to ensure it doesn’t happen again.
We made the naive assumption based on SQS documentation that a queue would be always available given the Redundant Infrastructure guarantee. More specifically, we assumed that for any logical queue, there were many physical queues across Availability Zones providing the queues availability. This evidently is not the case. While our primary event processing queue had been alive for over 2 years and processed many 10s of billions of events, on the evening of August 26th it simply vanished and reappeared a few hours later.
While our code was written to handle certain errors from SQS like timeouts and invalid messages, we had no graceful handling of the case when a queue disappeared. We simply failed fast and, unfortunately, went down.
Amazon had this to say about the outage:
Thank you for contacting AWS Support.
The SQS service in US-EAST-1 experienced high error rates for a small percentage of queues during the time you have mentioned. During this time, all Send and Receive calls to an impacted queue would fail. No data on the impacted queues was lost during the event.
The root cause is believed to be a defect in the driver for the network cards that caused excessive network connection errors in rare circumstance. The configuration has been updated on the hosts to prevent the event from recurring.
I can now report that the issue has been successfully resolved and the service is operating normally.
Additionally please find our posting on the AWS Service Health Dashboard – http://status.aws.amazon.com/
I apologize for any inconvenience caused.
Should you require further assistance or have any question, please do not hesitate to contact us.
How we fixed the issue
Our entire infrastructure runs on AWS and while we’ve had a number of issues with its various services, it’s been a valuable platform for us as we have scaled. We strive to build our systems to be fault-tolerant being mindful of the failure modes of distributed systems and use queues (SQS specifically) to help coordinate state between systems and services. In our infrastructure, we have approximately 30 different queues that coordinate work including: processing event data, disseminating ML model changes, queuing outbound notifications to our customers, updating search indices, and much more. Each of these queues was backed by a single SQS queue, until today.
Today, we have launched a change to our systems such that for any logical SQS queue, say “incoming-events”, we create N redundant queues to implement it. That is, the logical queue “incoming-events” (for N=3) is implemented by “incoming-events-0”, “incoming-events-1”, and “incoming-events-2”. Clients of the queue consume events round-robin and for any queue for which the ratio of failed requests to successful requests is greater than a configured threshold, we failover to the next available queue and mark the failed queue bad. Using an exponential back-off scheme, queues that are marked bad are removed from regular processing for a short period of time and are returned to the pool of available queues only when they are healthy again.
We are confident that these changes adequately handle the various failure modes of SQS, but clearly they do not address the issue that SQS could altogether become unavailable. We are working on a plan to address that greater issue and will post more about that in the future.