Trusty fleet build queue
Incident Report for CircleCI
Postmortem

We were alerted by Support that multiple customers were having builds queued for long periods of time. SRE looked into the situation and realized that our Trusty fleet was unbalanced - this meant that some jobs were starved of resources even though our monitoring system showed plenty of capacity.

The root cause was determined to be an alert that triggered during a short period of time our on-call coverage was unavailable and the alert remained triggered even after full on-call coverage was restored.

We are going to review our on-call handoff process to remove the chance of this happening in the future.

Posted May 21, 2017 - 19:44 UTC

Resolved
During the weekend some Trusty build jobs were starved of resources and unable to be queued
Posted May 21, 2017 - 19:40 UTC