EC2 Host Failure

Incident Report for Aptible

Resolved

This incident has been resolved.

Posted Dec 23, 2021 - 09:17 EST

Monitoring

At this time, there are only 3 databases still impacted by the outage, because the database's associated EBS volume does not support `DetachVolume` or `CreateSnapshot`, limiting our ability . We have reached out and will continue coordinating 1:1 with the affected customers on the possible options:

- Restore from a prior backup (< 24 hours old)
- Switch to a replica database
- Wait for AWS to fully resolve the outage so that these volumes become operational again.

Posted Dec 22, 2021 - 13:22 EST

Update

At this time, the majority of affected resources have been restored. For databases that have not yet been restored:

* The outage prevents us from stopping or starting the EC2 instance on which the database is running, or (in most cases) creating a new instance.
* The outage prevents us from accessing the EBS volume associated with the database.
* We have just become able to start taking backups of the databases (via EBS volume snapshot), and we're running that for all affected databases, so that we can attempt to restore via up-to-date backup onto an existing EC2 instance, assuming the backups complete before AWS makes further progress resolving the issue.

If you have a replica for an affected PostgreSQL or Redis database, you can promote that replica to accept writes, and update your apps (via `aptible config:set`) to connect to the replica instead of the primary.

* On PostgreSQL 9.4 through 11: Run `COPY (SELECT 'fast') TO '/var/db/pgsql.trigger';` in a psql session connected to the database (via `aptible db:tunnel`)
* On PostgreSQL 12+: Run `SELECT pg_promote();` in a psql session connected to the database (via `aptible db:tunnel`)
* On Redis: Run `SLAVEOF NO ONE` in a redis-cli session connected to the database (via `aptible db:tunnel`)

If you have an affected database for which you don't need the data (e.g. a Redis database used for caching), we recommend creating a _new_ database and updating your apps to connect to this new database.

Posted Dec 22, 2021 - 10:12 EST

Identified

The root cause of the failure is networking connectivity issues. You can also visit AWS status page for additional updates and information: https://status.aws.amazon.com/

Posted Dec 22, 2021 - 07:53 EST

Update

We are continuing to investigate this issue.

Posted Dec 22, 2021 - 07:16 EST

Investigating

We are investigating an EC2 dedicated host failure affecting a small number of apps and databases. Affected apps are currently being restarted on healthy instances (apps scaled to 2 or more are automatically distributed across availability zones and will automatically failover).

Posted Dec 22, 2021 - 07:16 EST

This incident affected: Aptible Deploy.