Slow spin up times for docker jobs using a custom docker image hosted on GCR

Incident Report for CircleCI

Postmortem

Summary:

Between 14:15 UTC on April 25th and 03:00 UTC on April 28th, customers using container images hosted in Google Container Registry (GCR) experienced build delays and failures due to timeouts. During this time, about 2% of Docker jobs experienced problems. A subset of these, about 0.9% of all jobs using GCR, failed. This resulted from a GCR service degradation that increased pull latency during the affected interval. We thank customers for their patience during this period.

The original status page can be found here. It includes guidance and workarounds in the event of a similar outage.

What Happened

All times are in UTC.

At 14:00 on April 25th, GCR pull latency started to spike. Several customers reported build failures over the next several hours, and at 22:48 we started investigating potential internal causes.

By 01:07 on April 26th, we confirmed that the slowness was not due to internal code changes, and posted a status page update. We narrowed the issues to private (customer-specified) registries and recommended customers retry failed builds.

From here, we engaged our support partners and continued to inform customers as we learned more. We recommended alternative container registries as workarounds and monitored pull latency from GCR until the problem was resolved. At 20:47 on April 27th, GCR’s status page reflected the outage, and by 22:30 on April 28th service returned to normal.

Future Prevention and Process Improvement:

While we cannot prevent this particular type of issue, we did discover improvements we can make to our systems and processes.

We will add registry-specific monitoring for container pull latency. This will allow us to proactively inform customers when degradation like this occurs, and offer workaround advice sooner.

Additionally, we discovered a bug in a rollback script that has since been fixed. While this would not have changed the outcome of this incident, catching it now prevents future problems.

Posted May 06, 2022 - 21:38 UTC

Resolved

Google has resolved the GCR issue and image pull times on CircleCI have returned to normal. Please let us know if you see further issues.

Posted Apr 29, 2022 - 03:06 UTC

Update

Google Status is now posted live referencing these issues: https://status.cloud.google.com/

Google Container Registry is experiencing elevated latencies in multi-regions US, Europe, Asia. Affected customers will experience intermittent delays while pulling images.

Posted Apr 27, 2022 - 20:47 UTC

Update

This issue continues to be localized to GCR. We will provide our next update when we see an improvement from GCR. From our previous update, the following still applies:

At the moment, we're still seeing extended durations when pulling GCR-hosted images.

In some cases the operation can result in a timeout and cause the build to fail. If this occurs, please try re-running the build, or if you have that option, consider temporarily using a different container registry.

The underlying cause stems from outside of CircleCI; nonetheless we do understand the impact it has on your CircleCI builds, and we will keep you informed of the situation.

Thank you for bearing with us.

Posted Apr 26, 2022 - 19:26 UTC

Update

We are continuing to monitor this issue and communicate with our providers to get updates.

At the moment, we're still seeing extended durations when pulling GCR-hosted images.

In some cases the operation can result in a timeout and cause the build to fail. If this occurs, please try re-running the build, or if you have that option, consider temporarily using a different container registry.

The underlying cause stems from outside of CircleCI; nonetheless we do understand the impact it has on your CircleCI builds, and we will keep you informed of the situation.

Thank you for bearing with us.

Posted Apr 26, 2022 - 13:16 UTC

Update

We are still actively working with our providers to identify the root cause of this issue.

Thank you for bearing with us.

Posted Apr 26, 2022 - 10:57 UTC

Monitoring

We have reason to believe that this underlying issue is stemming from outside of CircleCI.
We are working with our providers, and are continuing to monitor the issue.

Please retry failed builds, if you are still seeing timeouts in pulling GCR-hosted docker images.
If useful as a workaround, other image registries (Docker Hub, quay.io, and AWS ECR) are not currently experiencing degradation.

Thank you for your continued patience and understanding in this matter.

Posted Apr 26, 2022 - 03:35 UTC

Update

We are continuing to investigate intermittent slow download speeds when pulling docker images hosted on GCR.

Please retry failed builds, if you are still seeing timeouts in pulling GCR-hosted docker images.
If useful as a workaround, other image registries (Docker Hub, quay.io, and AWS ECR) are not currently witnessing degradation.

Thank you for your continued patience and understanding.

Posted Apr 26, 2022 - 02:58 UTC

Update

Thank you for your patience as we continue to investigate slow download speeds when pulling private docker images hosted on GCR. Please retry any failed builds. We appreciate your understanding.

Posted Apr 26, 2022 - 01:34 UTC

Update

We have investigated the issue and found that some docker image pulls have a slow download speed. This seems to occur only with private images and resolves when the job is retried. If you are experiencing this issue please cancel your build and retry. Thank you for your patience and understanding.

Posted Apr 26, 2022 - 01:07 UTC

Investigating

Some customers who are using a custom docker image hosted on GCR are seeing network slowness when pulling images. This is causing slow spin up times, timeouts and job failures.

Posted Apr 26, 2022 - 00:23 UTC

This incident affected: Docker Jobs.