Issues with S3 and Job Starts

Incident Report for CircleCI

Postmortem

AWS S3 was offline from around 9:30AM PST until roughly 2PM PST.

The failure of S3 caused further disruptions to AWS EBS, EC2 and ECR as S3 stores volumes, container images and AMIs. This disruption made CircleCI unable to start new servers, or new build containers. We were not an isolated incident, our upstream resources were also severely impacted. As one example, we stopped receiving any web hooks from GitHub due to the disruption.

The immediate impact of the outage was an increase in the backlog of new jobs. As this happened during the time of the day we typically ramp to meet demand, our server count was under capacity which slowed our ability to process jobs in the queue.

While we waited for AWS to restore S3, we worked to ensure we did not lose any current servers and prepared for the influx of build jobs once web hooks were restored. Our processing of the backlog of jobs continued for approximately two hours after AWS restored services as we worked within the AWS enforced recovery limits to ramp up our server capacity.

AWS S3 Service disruption report: https://aws.amazon.com/message/41926/

Posted Mar 06, 2017 - 17:44 UTC

Resolved

We have seen no remaining issues and as such we are marking this incident as resolved. Thank you for your patience and continued support.

Posted Mar 01, 2017 - 01:04 UTC

Update

The backlog has been processed and we are going to monitor the situation for another 20 minutes to ensure that we are in the clear.

Posted Mar 01, 2017 - 00:44 UTC

Update

The backlog of queued builds continues as we maintain a higher level than normal of resources. We will update again in 30 minutes.

Posted Mar 01, 2017 - 00:28 UTC

Update

The backlog of queued builds is being processed. We've brought additional resources online to meet demand and will continue to monitor. We anticipate service to return to normal soon. Next (and hopefully last) update in 30 minutes.

Posted Feb 28, 2017 - 23:50 UTC

Monitoring

AWS S3 is operating normally again. We will continue to bring additional resources online to process the backlog of builds and continue to monitor the situation closely. We'll update again in 30 minutes.

Posted Feb 28, 2017 - 23:16 UTC

Update

We are continuing to work on the backlog of builds while monitoring the AWS S3 status. We will update again in 30 minutes.

Posted Feb 28, 2017 - 22:15 UTC

Update

We're continuing to see improvement, however our systems are still impacted as a result of the S3 issues. Next update in 30 minutes. Thanks for your patience as we continue to monitor the situation.

Posted Feb 28, 2017 - 21:42 UTC

Update

We are starting to see signs of improvement. AWS expects to see lower error rates within the hour. Will update again in 30 minutes.

Posted Feb 28, 2017 - 21:02 UTC

Update

AWS believes they have identified the cause of the S3 issue and are working hard on implementing a fix. We'll update again in 30 minutes.

Posted Feb 28, 2017 - 20:13 UTC

Update

The AWS S3 availability issue persists. We'll continue to monitor and update again in 30 minutes.

Posted Feb 28, 2017 - 19:31 UTC

Update

We're continuing to experience issues with AWS S3 and are monitoring the situation closely. We'll update again in 20 minutes. Thank you for your patience.

Posted Feb 28, 2017 - 19:14 UTC

Update

The issue with AWS S3 is still ongoing, we are working on keeping our fleet ready to respond when the event is over

Posted Feb 28, 2017 - 18:39 UTC

Identified

We have identified the issue with our upstream providers and are monitoring the situation

Posted Feb 28, 2017 - 18:19 UTC

Update

We are seeing widespread issues with AWS and GitHub which are impacting our ability to handle builds

Posted Feb 28, 2017 - 17:53 UTC

Investigating

We are currently investigating this issue.

Posted Feb 28, 2017 - 17:51 UTC