Pipelines not loading
Incident Report for CircleCI
Postmortem

Incident Report: 2022-09-14 - Pipelines not loading

Summary:

On September 14, 2022 from approximately 07:45 to 17:05 UTC, customer pipelines were delayed due to an issue with high load on an internal database that is central to coordinating and orchestrating work on our platform. During this window, all customer pipelines experienced delays in starting. As part of the remediation effort, parts of the the application UI were disabled to reduce load on the impacted system between 08:21 and 15:34 UTC.

The original status page can be found here: Pipelines not loading

We want to thank our customers for your patience and understanding as we worked to resolve this incident.

What Happened

At approximately 07:40 UTC on September 14, there was a sudden increase in disk I/O and CPU utilization on two read-only replicas of a database that is part of the system responsible for coordinating and orchestrating work on the CircleCI platform. By 07:54 internal circuit breakers tripped and an incident was declared.

During the incident, engineers rolled the pods of the affected services several times in an attempt to terminate slow-running queries and free up connection pools. The team also enabled an “incident mode” flag that disables portions of the application user interface to help reduce the volume of calls into the database.

This did succeed in reducing the load temporarily, but the load would quickly spike again as the work queue backed up. After further analysis, the team determined that there were a number of factors contributing to the database load:

  • There was a standard background process running that was intended to help clean stale data.
  • A recent change to a database library had enabled SQL comments, and there was a hypothesis that this change had impacted the database’s internal statistics engine due to increasing cardinality of query strings, leading to additional load.
  • A database internal maintenance process started right as the load increased, apparently tipping the system over the edge.
  • Finally, the replicas had originally been sized smaller than the primary, but as additional workloads have shifted to the replicas they were not able to support the same load as the primary. This also contributed to replication lag from the primary, consuming resources and further slowing the responsiveness of the affected system.

All of these factors contributed to pushing the affected database into a state where it could not keep up with the workload.

The system was restored to full functionality by scaling up the replicas to match the primary and reverting the change to the database library.

By 17:05 UTC, the remediation work was completed, all pipelines were processing normally and the UI was re-enabled. Customer builds did continue to flow (at reduced capacity) for the entire duration of the incident, however in retrospect we did not communicate this effectively during the incident.

Future Prevention and Process Improvement:

Our database engineers have initiated a review and audit of our critical systems to evaluate and shift applicable workloads (like reporting and analytics) away from critical production systems. This work has been completed for the workflow system affected by this incident, and is ongoing and is targeted for completion by the end of 2022 for the remainder of our critical systems.

The change to enable SQL comments has now been gated behind a configuration flag and is not enabled by default.

We are working to revise our incident process to help facilitate more effective communication to our customers during incidents.

Posted Nov 09, 2022 - 18:59 UTC

Resolved
The Pipelines UI is now operational. Builds are continuing to process normally. Thank you for your patience
Posted Sep 14, 2022 - 17:38 UTC
Monitoring
The Pipelines UI is now operational. Builds are continuing to process normally.
Posted Sep 14, 2022 - 17:05 UTC
Update
We are continuing to investigate. Pipelines in the UI remain disabled but accessible via the API.
Builds are running as normal, including email notifications and VCS updates. Workflows and jobs can be accessed on the UI via their direct links.
Posted Sep 14, 2022 - 16:41 UTC
Update
We are still seeing degraded database performance. Pipelines in the UI remains disabled while we continue to investigate. Thank you for bearing with us!
Posted Sep 14, 2022 - 16:14 UTC
Update
We are continuing to investigate the database issue. The Pipelines page will remain unavailable for the time being, but accessible via the API. We will provide update in 1 hour or when we have more news to share.
Posted Sep 14, 2022 - 15:10 UTC
Investigating
We are seeing degraded database performance again. As part of the ongoing investigation, we have temporarily disabled the Pipelines page in our UI once again. Builds continue running as normal.
Posted Sep 14, 2022 - 14:43 UTC
Update
We are still seeing normal database performance and are continuing to monitor.
Posted Sep 14, 2022 - 13:41 UTC
Monitoring
Database performance has returned to normal service and the Pipelines page in the UI remains active. We are continuing to monitor performance.
Posted Sep 14, 2022 - 13:01 UTC
Update
We have re-enabled Pipelines in the UI. We are still seeing degraded database performance and are continuing to investigate. We will update in 1 hour or when we have more information to share.
Posted Sep 14, 2022 - 12:50 UTC
Update
We are continuing to investigate the database issues. Builds are running as normal, but we are continuing to leave Pipelines in the UI turned off while we continue to investigate. It is possible to view pipelines using the API:

https://circleci.com/docs/api/v2/index.html#operation/listPipelines

We will update in 1 hour or when we have new information to share.
Posted Sep 14, 2022 - 12:22 UTC
Update
We are continuing to investigate and working with our partners to establish the root cause of the database issues we are experiencing. While we do so the Pipelines page will remain unavailable, but accessible via the API. We will update in 1 hour or when we have more news to share.
Posted Sep 14, 2022 - 11:16 UTC
Update
We are still experiencing database issues. We have temporarily disabled the Pipelines page in our UI while we continue to investigate. Querying our API is still possible for those who need to access pipelines data:

https://circleci.com/docs/api/v2/index.html#operation/listPipelines
Posted Sep 14, 2022 - 10:49 UTC
Update
The pipeline page is still affected but builds are being processed. We continue to investigate the issue.
Posted Sep 14, 2022 - 10:25 UTC
Update
We are still investigating and working on identifying the root cause.
Posted Sep 14, 2022 - 09:55 UTC
Update
We are continuing to investigate this issue.
Posted Sep 14, 2022 - 09:32 UTC
Update
We continue to investigate the cause of the issue. Thank you for your patience while we're working on this.
Posted Sep 14, 2022 - 09:08 UTC
Investigating
We are currently investigating an incident that results in customers seeing an empty pipeline page and it is likely that builds are not being processed.
Posted Sep 14, 2022 - 08:44 UTC
This incident affected: Pipelines & Workflows and CircleCI UI.