EC2 Host Failure

Incident Report for Aptible

Resolved

This issue is completely resolved except for 4 customer databases whose EBS volumes continue to be inaccessible (for any API action, i.e. CreateSnapshot, DetachVolume, AttachVolume). We have reached out to each of these customers with recommended steps to get back online as quickly as possible. If you have not been directly contacted by Aptible, you should no longer be affected by this incident.

Posted Aug 31, 2019 - 19:40 EDT

Update

At this point, the only remaining affected resources are a handful of databases, the EBS volumes for which remain inaccessible due to the underlying AWS failure. We will be continuing to work to restore these databases, and will update this page once the incident is completely resolved.

Posted Aug 31, 2019 - 14:22 EDT

Update

At this point, all apps have been successfully recovered. Some databases' EBS volumes have been deemed safe to detach and re-attach by AWS, and we have successfully restarted those databases. For the remaining databases, we continue to wait until it is safe to touch the associated EBS volumes before completing recovery.

Posted Aug 31, 2019 - 11:57 EDT

Update

We are in the process of restarting affected apps, as well as replacing affected bastion (`aptible ssh`) instances. We are unable to proceed with restarting databases at this point, due the nature of how the underlying AWS issue affects the associated EBS volumes. As soon as we confirm from AWS that it is safe to detach these EBS volumes, we will proceed with recovering affected databases.

Posted Aug 31, 2019 - 10:45 EDT

Identified

AWS has confirmed that some instances in the a single AZ in us-east-1 are down. We are working on migrating impacted apps and databases to healthy hosts.

Posted Aug 31, 2019 - 10:30 EDT

Investigating

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Posted Aug 31, 2019 - 08:52 EDT