The hosts which were serving DALL·E's Web Experience and the text-curie-001 API went offline. This was due to hosts not properly joining our Kubernetes cluster. The nodes didn't re-join the cluster due to timing issues of a particular GPU diagnostics command that exceeded a timeout. We do not have control over this timeout or boot script since this is managed by our service provider. This was not anticipated since this behavior is unique to a particular node type in a particular region. The nodes were being cycled as part of a planned Kubernetes version upgrade.
text-curie-001 was quickly moved to an unaffected node and service restored.
Due to the size of DALL·E's infrastructure and limited capacity, moving to healthy nodes was not an option. The resulting decrease in capacity degraded DALL·E service, as the request queue grew long enough that most requests timed out before image generations could be served.
During this incident, we introduced several levers for graceful load shedding in events where DALL·E receives more requests than it can support. To implement one of these levers, we ran a database migration. This migration stalled, had to be rolled back, and then retried due to unexpected row locks. During this time we were unable to serve DALL·E and this issue exacerbated our recovery time.
Moving forward, we are implementing additional levers for load shedding and investigating alternative means of serving greater numbers of requests, given capacity constraints. One such lever is rejecting all inbound requests when the request queue grows beyond a certain length, if the request would certainly time out before returning anyway. Additionally, we are reconfiguring our nodes to give us full control over boot-up scripts and adding new procedures to check for unexpected inconsistencies before full node cycles.