VMs failing to be created

Incident Report for CircleCI

Postmortem

Summary

Customers experienced lengthy delays in job starts and stuck builds from 17:00 UTC October 27 to 01:16 UTC October 28. All jobs were affected, starting with machine jobs and continuing into docker executor as the incident progressed.

We thank customers for your patience and understanding as we worked to resolve this incident. We know how critical CircleCI is to our customers, and we want to share what happened along with our efforts to ensure this kind of incident doesn’t happen again. We will follow up this report with a more detailed one later in the week, as our investigation is ongoing.

What happened

Three issues, each necessary but only jointly sufficient, resulted in an outage that affected all workloads and lasted seven hours.

Errors from AWS increased unexpectedly then subsided quickly;
Our VM scaling algorithm scaled more rapidly than expected in response to changes in the system;
An independent failover mechanism that quickly creates VMs in an alternate provider behaved unexpectedly when returning to the primary provider.

Most of our compute workload operates in AWS, with the remaining ~20% in GCP. For some of our AWS workload, we can fail over to GCP in case of emergency. While GCP is the “secondary” provider for this flexible workload, it remains the primary provider for other VM workloads.

As traffic increased on October 27th, we experienced a brief spike in “out of capacity” errors from AWS. This triggered a failover mechanism to move workload to a secondary provider, GCP. Creating VMs in the secondary provider proceeded more rapidly than expected because the automated scaler did not receive messages indicating VMs had been successfully created. As the scaler continued to create VMs, we hit our secondary provider CPU quota. That prevented VMs for other, unrelated workloads from being created.

Once the capacity errors for AWS subsided, the failover mechanism returned flow to its normal location. The VMs created in GCP remained in place, continuing to block our normal GCP workload. Our VM orchestrator responded to the blocked workload by trying to create more instances, which resulted in rate-limiting by both cloud providers. At this point, no work was able to flow through the system.

To unstick the system, we disabled VM creation and manually purged job queues. We estimate about 40,000 jobs were dropped during this restart, which appeared to users as a “canceled job,” as if you’d canceled it yourself. We manually deleted the excess VMs in GCP, and re-enabled VM creation, restoring normal service for all jobs.

We apologize for this outage. We know that it was a prolonged disruption for our customers, who rely on CircleCI for critical workflows. We are taking active and involved steps to prevent issues of this type in the future.

Future prevention

Immediately after the incident, our engineering teams made changes to mitigate runaway API calls resulting from the three contributing factors. When traffic increased later the next day, we observed the system remained stable, which gave us confidence that this type of incident will not re-occur in the future.

We’re also implementing ways to recover faster when problems do occur. Some of these are already finished, and some of them are in progress.

Additionally, we’ve implemented an easier way to shut off traffic at an earlier point in the system. A common failure mode in this system involves hitting limits -- API rate limits, CPU limits, or others. Having an easy way of turning off the flow in critical situations will help us recover faster once we’ve reached that state. We anticipate that our customers would experience this as a much shorter-lived service interruption impacting a small number of jobs.

Longer-term, we are making larger investments into the systems that handle these workloads. We’ve started small, with a new Mac resource class currently in closed preview, and will continue to build from there.

Posted Nov 02, 2021 - 21:26 UTC

Resolved

Thank you again for your patience and understanding as we worked through this incident. All jobs are processing normally, so we are marking this as Resolved.

Posted Oct 28, 2021 - 01:16 UTC

Monitoring

We are now monitoring the situation. Customers can try to cancel and restart their stuck jobs, to trigger new builds.

Posted Oct 28, 2021 - 01:09 UTC

Update

We are continuing to investigate the issue.

Posted Oct 28, 2021 - 00:18 UTC

Update

Continuing to investigate.

Posted Oct 27, 2021 - 23:21 UTC

Identified

Identified additional issues, continuing to investigate. Jobs will not run at this time.

Posted Oct 27, 2021 - 22:31 UTC

Update

Identified and continuing to resolve.

Posted Oct 27, 2021 - 22:27 UTC

Update

Continuing to monitor.

Posted Oct 27, 2021 - 21:52 UTC

Update

We are continuing to monitor as jobs come back online.

Posted Oct 27, 2021 - 21:23 UTC

Update

We are continuing to monitor as jobs come back online.

Posted Oct 27, 2021 - 20:46 UTC

Update

We are continuing to monitor as jobs come back online.

Posted Oct 27, 2021 - 20:16 UTC

Monitoring

We are currently in a monitoring state. Older jobs should slowly begin to start again while newer jobs may continue to see failures as the system catches up.

Posted Oct 27, 2021 - 19:58 UTC

Update

We are continuing to work on this issue. We are currently implementing a fix.

Posted Oct 27, 2021 - 19:45 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 27, 2021 - 18:29 UTC

Update

We are continuing to investigate this issue.

Posted Oct 27, 2021 - 18:21 UTC

Investigating

We are currently investigating widespread failures with machine jobs and remote docker.

Posted Oct 27, 2021 - 17:50 UTC

This incident affected: Docker Jobs and Machine Jobs.