AWS Outage

Incident Report for Aptible

Postmortem

AWS has identified the root cause of the Endpoint unavailability:

Between 2:26 PM and 3:04 PM PDT(9:26PM ~ 10:04 PM UTC) we experienced increased packet loss for traffic destined to public endpoints in the US-EAST-1 Region, which affected Internet and public Direct Connect connectivity for endpoints in the US-EAST-1 Region.

This is, unfortunately, essentially the same impact we've seen in two previous incidents, although AWS's description of the cause is slightly different:

October 15th 2022: https://status.aptible.com/incidents/grf6gdrrszf9

Between 12:20 AM and 11:28 AM PDT, we experienced intermittent failures in Route53 Health Checks impacting Target Health evaluation in US-EAST-1. The issue has been resolved and the service is operating normally.

September 27th, 2021: (Only a couple of Endpoints were impacted, so no incident was created)

On September 27, 2021, between 8:45 AM and 2:09 PM PDT, Route53 experienced increased change propagation times for Health Check edits where unexpected failover to their secondary application load balancer (ALB) occurred despite their primary ALB targets being healthy. The issue has been resolved and the service is operating normally.

While AWS describes these incidents as "increased change propagation times", "intermittent failures", and "increased packet loss", and apparently do not qualify as an incident to be posted to https://status.aws.amazon.com, the observed impact to our customers is very clear: the impacted Endpoints are totally unreachable for a period.

As such, we will permanently implement the "temporary" change we made on October 15th: we will be disabling the Route53 health checks (and the associated custom error page) for all Endpoints, as this has been the root cause of these availability incidents.

As we indicated to customers during the Oct 15th and Nov 3rd incidents, you may restart any App in order to immediately disable the Route53 health check. Any App which has been deployed, restarted, or scaled since October 15th will already have it disabled, and we will make another announcement when we intend to disable it globally on all Apps for which it remains enabled.

Posted Nov 03, 2022 - 16:37 EDT

Resolved

We no longer see any impact, and will continue investigating for an RCA.

Posted Nov 02, 2022 - 18:28 EDT

Monitoring

Based on a random sampling, and reported affected Endpoints, we are no longer seeing any impact. We will continue to monitor the situation.

Posted Nov 02, 2022 - 18:18 EDT

Update

We're seeing many Endpoints recover without action being taken, so we're looking into ways to identify Endpoints that remain impacted so that we can efficiently fix them.

Restarting known impacted Apps remains the quickest solution that we know of.

Posted Nov 02, 2022 - 18:08 EDT

Identified

We've observed that running `aptible restart --app $handle` can resolve the underlying issue with the ELB, and recommend restarting any of your impacted Apps at this time.

Posted Nov 02, 2022 - 17:54 EDT

Investigating

We are currently investigating a large number of unreachable ELBs in AWS's us-east-1 region, and are wait for acknowledgement from AWS and trying to narrow the scope of the failures in order to provide failover/workarounds if possible.

Posted Nov 02, 2022 - 17:45 EDT

This incident affected: Aptible Deploy.