Quay.io outage
Incident Report for Aptible
Postmortem

What happened?

  • Starting at 18:20 UTC Quay.io became increasingly slow to respond to Docker Registry queries.
  • 18:20 UTC: A Redis Database provisioning operation failed. We started investigating immediately. The instance that was pulling the image was running an older version of Docker (1.6) than our current baseline (1.11), so we originally assumed that the issue might be on our end.
  • 18:36 UTC: A second Redis Database provisioning operation failed, on a different instance, running a different version of Docker. We turned our investigation towards the registry (the only shared component in this context) and identified that Quay was indeed unresponsive.
  • At this stage, Quay was successfully returning metadata when queried by the Docker daemon running on our instances, but timing out when returning layers. This is why Database provisioning operations would fail (Database image layers tend to not be cached locally because we have a large number of different database images), whereas App operations continued to succeed (those operations also need to pull from Quay, but there is a much smaller number of images, which are used a lot more frequently, so the layers tend to always be in cache).
  • 18:36 UTC: We decided to failover to the Docker Hub, where all our Database images are replicated.
  • 18:46 UTC: We completed the failover to the Docker Hub for all our Database images.
  • 18:46 UTC: We turned to attention to other images hosted on Quay. Indeed, we offer Aptible-managed base Docker images via Quay (e.g. quay.io/aptible/ruby), which some customers depend on for their builds (in other words, they use them in their Dockerfile FROM line). Most of our base images are mirrored on the Docker Hub, so customers could switch to them to let their builds succeed, but 2 of them (which are seldom used) were missing, so we copied them over.
  • 18:55 UTC: We identified that Quay was starting to timeout on metadata queries as well. This was going to affect App operations. As a result, we decided to failover to the Docker Hub for all our registry activity.
  • 19:07 UTC: We completed the failover to the Docker Hub for all registry activity.
  • 19:07 UTC - 21:35 UTC: We proactively reached out to all customers that were depending on a Quay-hosted base image to let them know they could switch to a mirror on the Docker Hub.

By the numbers:

  • Database Operations failed (due to the outage affecting Quay registry layers): 2.
    • They were both retried successfully by our team within 30 minutes.
  • App Operations failed (due to the outage affecting Quay registry metadata): 3.
    • These were retried by the customers a few minutes later.
  • Customers proactively contacted when their build failed: 16.

Next steps:

  • We've made improvements to our image replication process to ensure all images are replicated to the Docker Hub. This was completed on Friday, July 28, 2017.
  • We've made improvements to our orchestrator code so that switching over from Quay to the Docker Hub (and vice versa) is even faster.
Posted Jul 31, 2017 - 15:51 EDT

Resolved
Quay.io reports their registry is back up.

Note: Enclave itself has been fully operational since we completed our failover to the Docker Hub several hours ago.
Posted Jul 26, 2017 - 20:32 EDT
Monitoring
We have completed failover to the Docker Hub for all our Database and Proxy Images. Database provisioning should complete successfully going forward.

If you are still experiencing an issue deploying, that is likely to be caused by you depending on a base image hosted on Quay. Note that most Aptible base images are mirrored on the Docker Hub, so if you are using one of those, you can update your Dockerfile to reference those (just remove the quay.io/ prefix).

Check here for the listed of mirrored Aptible base images: https://hub.docker.com/u/aptible/
Posted Jul 26, 2017 - 15:19 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 26, 2017 - 14:46 EDT
Investigating
Quay.io (a Docker registry) is experiencing issues: http://status.quay.io/

As a result, builds that depend on images hosted on Quay may take a long time or fail. Note that most Aptible base images are mirrored on the Docker Hub, so if you are using one of those, you can update your Dockerfile to reference those (just remove the quay.io/ prefix). Check here for the listed of mirrored images: https://hub.docker.com/u/aptible/

We also rely on Quay for hosting our Database and Proxy images. We are failing over to the Docker Hub now, but provisioning new databases or restartingapps may fail while we complete the failover (however, this will *not* cause downtime on your apps whatsoever)
Posted Jul 26, 2017 - 14:46 EDT