EC2 Host Failure
Incident Report for Aptible
Resolved
This issue is completely resolved except for 4 customer databases whose EBS volumes continue to be inaccessible (for any API action, i.e. CreateSnapshot, DetachVolume, AttachVolume). We have reached out to each of these customers with recommended steps to get back online as quickly as possible. If you have not been directly contacted by Aptible, you should no longer be affected by this incident.
Posted 21 days ago. Aug 31, 2019 - 19:40 EDT
Update
At this point, the only remaining affected resources are a handful of databases, the EBS volumes for which remain inaccessible due to the underlying AWS failure. We will be continuing to work to restore these databases, and will update this page once the incident is completely resolved.
Posted 22 days ago. Aug 31, 2019 - 14:22 EDT
Update
At this point, all apps have been successfully recovered. Some databases' EBS volumes have been deemed safe to detach and re-attach by AWS, and we have successfully restarted those databases. For the remaining databases, we continue to wait until it is safe to touch the associated EBS volumes before completing recovery.
Posted 22 days ago. Aug 31, 2019 - 11:57 EDT
Update
We are in the process of restarting affected apps, as well as replacing affected bastion (`aptible ssh`) instances. We are unable to proceed with restarting databases at this point, due the nature of how the underlying AWS issue affects the associated EBS volumes. As soon as we confirm from AWS that it is safe to detach these EBS volumes, we will proceed with recovering affected databases.
Posted 22 days ago. Aug 31, 2019 - 10:45 EDT
Identified
AWS has confirmed that some instances in the a single AZ in us-east-1 are down. We are working on migrating impacted apps and databases to healthy hosts.
Posted 22 days ago. Aug 31, 2019 - 10:30 EDT
Investigating
We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).
Posted 22 days ago. Aug 31, 2019 - 08:52 EDT