Delays in workflows

Incident Report for CircleCI

Postmortem

Summary:

On September 27, 2021, from 07:05 to 14:46 UTC CircleCI customers experienced delays in workflows. Due to our efforts to fix the initial incident, some customers were unable to access the UI throughout the incident. Simultaneously, AWS EBS experienced an outage in US-EAST, which degraded our MongoDB databases, causing further delays. We want to thank our customers for your patience and understanding as we worked to resolve this incident.

Contributing Events:

A third-party configuration event inadvertently triggered an overly excessive amount of workflows in quick succession which resulted in a spike of auto-cancellations being sent to our system. The large number of auto-cancellations caused slowness as requests flooded our job queues and datastores. The overall impact of this was that customers experienced delays starting workflows and jobs.

The second event which contributed to this outage was an AWS incident specific to EBS volumes. This left some of our MongoDB volumes in a degraded state which led to instability within our MongoDB clusters.

Future prevention:

Our engineering teams who handled this incident have identified a few areas where we can make some immediate improvements, in addition to planning some more medium-term work. The immediate improvements are around further enforcing concurrency limits to how we run pipelines, as well as adding further rate limits to how we handle API calls and webhooks.

In response to the AWS EBS incident, we have already completed some work to better handle failovers within our MongoDB cluster.

What Happened

(all times UTC)

On September 27th at 6:20 UTC, there was a large spike in messages to our “cancel workflows” messages on the workflows queue.

At 6:32 we were alerted that the workflows queue was not draining fast enough. After investigating, we identified that all customer workflows were impacted with delays at 7:08. At this time, we were still investigating why the queue was backed up.

At 7:24 we identified the spike in cancellation messages. At 7:36, we scaled up the workflows orchestration service to speed up draining the queue. Shortly after, we noticed that we were no longer receiving an elevated volume of cancellation messages.

At 7:47 we started investigating the potential reasons for the initial spike. At 7:55 we identified a third-party integration that mistakenly created a large number of workflows. Our feature for the auto-cancellation of redundant builds tried to cancel these, causing delays for all workflows messages.

At 8:13 we began updating the impacted workflows orchestration service Kubernetes pods to stop auto-cancellation. Between 8:13-8:44, the update was applied to all pods. Cancellation messages started trending down, but jobs were still not running. Shortly after, the MongoDB replication lag began growing.

At 8:58, all of the impacted Kubernetes pods were updated but cancellation queries continued to spike. Upon investigation, some pods had restarted in which they had reverted to the original version, without the update. A second update was applied to return the expected data shape. By 9:29, the deployments of some service instances kept failing due to overload and reverted to the original code. We removed liveness probes to stop workflows pods from restarting. This was expected to alleviate pressure on MongoDB too.

However, by 9:45 one of our MongoDB was “impaired” and was unclear if it was related to an incident for AWS US-EAST occurring at the same time. The workflows orchestration service was scaled down to alleviate pressure, a MongoDB replica was elected primary, but due to replication lag, we had a rollback to 8:47. By 10:19 we scaled workflows orchestration service back up and shortly after confirmed that MongoDB was healthy, with the exception of the data rollback.

At 10:25 we noticed messages from the problematic workflows that were being handled beyond the point we expected them to be dropped. The team worked on a new update and confirmed by 11:08 that we were able to process all of these cancellation messages. At 11:36 jobs were being processed and the Nomad cluster scaled up to handle the influx of jobs. We continued to scale to better handle the load and updated our status page to “monitoring” at 12:11 as we processed the backlog of workflows.

While monitoring, we noticed some jobs taking nearly 20 seconds to start and found this was due to some unrelated automated checks, which were quickly disabled. By 13:22 the orchestration service’s CPU was back to normal. At 13:44 the workflows queue was drained and at 14:07 we were back to normal operations with some delay in machine jobs and status updates. At 14:46 we declared the incident as resolved.

Posted Oct 05, 2021 - 21:32 UTC

Resolved

Thank you again for your patience and understanding as we worked through this incident. All jobs are processing normally, so we are marking this as Resolved.

Posted Sep 27, 2021 - 14:46 UTC

Update

We have processed the backlog of jobs and there should no longer be delays with jobs starting. However, there is still some delay with GitHub checks so the status of jobs being reported back may be delayed. Thank you for your continued patience while we monitor the progress.

Posted Sep 27, 2021 - 14:22 UTC

Update

We are continuing work on processing the backlog of jobs created by the incident. Docker jobs no longer have delays, however, new machine, macOS, and Windows jobs may still be delayed. Thank you for your continued patience while we monitor the progress.

Posted Sep 27, 2021 - 14:10 UTC

Update

We are continuing work on processing the backlog of jobs created by the incident. New jobs may still be delayed. Thank you for your continued patience while we monitor the progress.

Posted Sep 27, 2021 - 13:25 UTC

Update

We are continuing to process the backlog of jobs created by the incident. We are still experiencing delays with jobs and are continuing to work on reducing the backlog.

Posted Sep 27, 2021 - 12:56 UTC

Monitoring

We are continuing to process the backlog of jobs created by the incident. We are still experiencing delays with jobs and are continuing to monitor.

Posted Sep 27, 2021 - 12:16 UTC

Update

We have taken several actions to restore the service, and jobs are now starting to be processed again.

However, as we are processing the jobs backlog created by the incident, there are significant delays in jobs execution.

Posted Sep 27, 2021 - 11:48 UTC

Update

We are continuing to work on a fix for this issue.

Posted Sep 27, 2021 - 11:37 UTC

Update

We are continuing to work on a fix for this issue.

Posted Sep 27, 2021 - 10:57 UTC

Update

We have now contained the collateral issues; you can reach the CircleCI UI again, however the data is not currently accessible.

We are still working on the initial incident.

Posted Sep 27, 2021 - 10:14 UTC

Update

We are tirelessly working at fixing this issue.

Some of the actions we are taking are currently also impacting other components/services.

Right now, you might be unable to access the CircleCI UI. This is unfortunately a side-effect of our effort to fix the initial incident.

We will continue to inform you on our progress.

Posted Sep 27, 2021 - 09:48 UTC

Update

We are still working on a fix that will allow us to fully restore the service in safe manner.

Please accept our apologies for the disruption.

Thank you for your patience while we are working on this issue.

Posted Sep 27, 2021 - 09:20 UTC

Update

We are continuing our effort to find a fix for this issue.

Posted Sep 27, 2021 - 08:56 UTC

Identified

We have identified the cause of this issue, and we're currently assessing the actions we need to take to safely restore the service.

We realize this is causing significant disruption to our customers' operations, and we're diligently working on a solution.

Posted Sep 27, 2021 - 08:30 UTC

Update

We are continuing to investigate the cause of Workflows delays.

Posted Sep 27, 2021 - 07:46 UTC

Update

We are continuing to investigate the cause of workflows being delayed.

Posted Sep 27, 2021 - 07:34 UTC

Update

We are continuing to investigate this issue.

Posted Sep 27, 2021 - 07:22 UTC

Update

We are continuing to investigate this issue.

Posted Sep 27, 2021 - 07:07 UTC

Investigating

We are currently investigating an issue where workflows are being delayed.

Posted Sep 27, 2021 - 07:05 UTC

This incident affected: Docker Jobs, Machine Jobs, macOS Jobs, Windows Jobs, Pipelines & Workflows, CircleCI UI, and Runner.