Long Startup Times for macOS Jobs

Incident Report for CircleCI

Postmortem

In contrast with the rest of our job execution fleet, we have a fixed capacity of hardware to run macOS jobs and so when that capacity is reached customers experienced increased queuing and longer spin-up times for jobs.

We had hoped to onboard the latest batch of machines before customers jobs were affected but supply chain issues cause our delivery to be delayed by several weeks and so on April 20 we were running with insufficient capacity for our growing demand.

Starting from 20:10 UTC on April 20th we experienced increased demand for macOS jobs running on Mac Gen2 resources. When we were alerted to additional queuing the team worked to relieve the pressure on the system. We did this via four parallel tracks of action:

Optimizing the throughput of the existing fleet
1. There was a pre-existing issue causing provisioning to be slower when capacity was low which impacted the spin-up time for customer jobs. The team was able to find a solution that meant that the spin-up time was restored to normal levels even when capacity was low. This had the effect of us using our capacity more efficiently and so more jobs flowed through the system. We were able to enable that quickly and so we were able to see some recovery on the first day which meant that we were able to handle our peak usage a little better on the following days.
Adding capacity to the Gen2 VM pool from our large resource fleet
1. We configured a number of our older machines from our existing fleet to be able to run Gen2 jobs. This required us to deploy the Gen2 images to each of the hosts. The images have sizes around 90Gb and so this is a process that requires several hours to complete. We chose the two most popular images given that we were seeing the highest load there. These additional hosts were ready the next day and we saw reduced queuing time as a result.
2. This extra capacity has now been disabled due to fact that some newly-provisioned Gen2 hosts (see #4) are giving us the capacity to handle peak usage with room to spare. We are keeping these older machines configured in reserve so we can deploy them quickly if capacity issues reoccur.
Communications to ensure that customers were aware that they could move their jobs to the the other macOS resource classes, large and medium where we were not experiencing capacity issues
1. Some of our customers moved resource class which had the effect of reducing the load on our Gen2 resources, this meant that we were able to handle our peak usage a little better.
Work to add extra capacity to the Gen2 fleet
1. Our data center provider had received a new batch of machines but had only just started to configure the machines to be added to the fleet. The configuration step involves making changes to the hardware and software of the host and requires hands-on access which makes the process relatively slow. We were able to add 40% of that batch of machines on Friday at 15:00 UTC and saw that job queuing immediately improved.

Now that these actions are complete we are seeing that our Gen2 fleet has plenty of extra capacity at peak times and jobs are starting without delay.

We appreciate everyone’s patience during this time.

Posted Apr 29, 2022 - 19:49 UTC

Resolved

We are observing normal macOS VM provisioning and are moving to resolved. All macOS resource classes are available for use. Thank you for your patience.

Posted Apr 20, 2022 - 23:20 UTC

Monitoring

We are seeing affected customers' VM provisioning trend towards recovery and seeing spinup times of ~3 minutes. We are continuing to monitor.

Posted Apr 20, 2022 - 23:02 UTC

Update

We are continuing to work on a fix and are observing affected customers see spinup times of <5 minutes. We are seeing recovery in our macOS VM backlog. Thank you for your patience.

Posted Apr 20, 2022 - 22:49 UTC

Update

We are making changes to ensure VMs are being provisioned as efficiently as possible. We expect this change to take effect in the next ~45min.

Posted Apr 20, 2022 - 22:11 UTC

Update

We are seeing spinup times lowering to around 6min for affected users. You may also see better performance by using lower parallelism.

Posted Apr 20, 2022 - 21:44 UTC

Update

We are continuing to work on a fix. For affected users, average queue time is hovering between 3-15min with a spinup time remaining around 15-20min. We encourage the use of the large resource class as a workaround.

Posted Apr 20, 2022 - 21:30 UTC

Update

Spinup time on macos.x86.medium.gen2 resource class can be up to ~20min at this time. We are continuing to work on a solution to restore normal service as soon as possible.

Posted Apr 20, 2022 - 21:14 UTC

Identified

We have identified the cause of this disruption and are working on a solution.

Posted Apr 20, 2022 - 20:55 UTC

Update

We are investigating issues due to high usage of our macos.x86.medium.gen2 resource class. We are not seeing any issues for our medium and large resources classes. We are working on configuration changes to ensure that our fleet handles the high usage more gracefully

Posted Apr 20, 2022 - 20:45 UTC

Update

We are very sorry for the interruption, we're investigating this issue to restore service as soon as possible. Please attempt utilizing the "large" resource class.

Posted Apr 20, 2022 - 20:22 UTC

Investigating

We are currently investigating this issue.

Posted Apr 20, 2022 - 20:10 UTC

This incident affected: macOS Jobs.