Problems restoring Workspaces for some executors

Incident Report for CircleCI

Postmortem

Summary:

At 09:53 UTC on April 29th a code change was deployed that resulted in all customer jobs using an attach_workspace step running on Google Cloud Platform (GCP) to fail. We rolled back the deployment immediately, which resolved customer impact by 12:21 UTC. We thank our customers for their patience and understanding during this outage.

The original status page can be found here.

What Happened

All timestamps are UTC.

A code change was released at 09:53 which would attempt to restore workspaces from Google Cloud Storage (GCS) for jobs running on GCP, and fail back to S3 in the case of any errors. Starting at 10:55 we began to receive support tickets from customers experiencing job failures due to failures in the attach_workspace step. At 12:12 we reverted the pull request for the contributing code change and immediately observed attach_workspace errors declining.

We are currently implementing a change in our workspace service to write workspaces to two providers. At the time of the incident, the double-write was only activated for a handful of internal projects. The code change on the 29th attempted to download workspaces from GCP and fallback to an alternative provider if the download failed. A bug in the code caused a failed download attempt to report as successful, so the failover was never triggered and almost all calls to download a workspace reported a Not Found error.

Therefore any job running on GCP that included an attach_workspace step would:

attempt to download the workspace from GCS;
receive a Not Found error;
erroneously report success due to the bug;
continue processing the step (without the required workspace present);
fail when something attempted to use the missing workspace.

Future Prevention and Process Improvement:

The incident revealed several issues in our detection and response processes. We have monitoring for attach_workspace failures but no alerting and have since added those alerts. We have updated automated testing to validate the expected behavior for this specific change. And we have fixed a glitch in our rollback script that prevented us from reverting the change faster.

We once again thank our customers for their patience as we worked to resolve this issue.

Posted May 09, 2022 - 15:18 UTC

Resolved

Restoring workspaces has returned to normal operation. Thank you for your patience throughout.

Posted Apr 29, 2022 - 12:45 UTC

Monitoring

We have identified the caused and implemented a fix. We are monitoring ongoing performance.

Posted Apr 29, 2022 - 12:26 UTC

Identified

We are observing some failures when restoring Workspaces on some executors including Machine and Windows. We have identified the cause and are working to resolve the issue.

Posted Apr 29, 2022 - 12:16 UTC

This incident affected: Machine Jobs and Windows Jobs.