AWS Outage: 10 Incidents That Affected the us-east-1 Region

A cloud region you miss feels almost deserted when everything around it is empty…

The image is probably exaggerated, but it illustrates the degree of centralization that still characterizes AWS. Its flagship region (us-east-1, also the oldest) stands out as a potential weakness, at least by the number of services that have their control plane there. Earlier this week, a worldwide incident rooted in an internal DNS issue brought this into sharp relief.

This incident is not, by any means, the first to center on us-east-1. In particular, there is a public post-mortem catalog AWS has compiled detailing problems it regards as having “a significant customer impact.”

April 2011: EBS spirals into an infinite loop

EC2 experienced disruptions largely tied to a subset of EBS volumes within a single Availability Zone (AZ). They stopped accepting read and write operations.

The root cause lay in a change intended to increase the capacity of the primary network — the one that carries communication between EBS nodes, EC2 instances, and the control plane.

The traffic redirection was misexecuted. Instead of failing over to a secondary router on the same network, all traffic was routed to the secondary network, dedicated to replication and with lower capacity. As a result, it became saturated.

Some nodes ended up cut off from both networks and lost contact with their replicas. Upon reestablishing connectivity, they triggered mass remirroring, quickly overwhelming the cluster and driving it into an infinite loop.

The cluster could no longer respond to API requests to create volumes. Since that API had a long timeout, calls piled up and exhausted the control plane threads. The control plane eventually had to shift processing to another AZ.

There was also a race condition in the EBS node code that favored failures during the concurrent shutdown of a large number of replication requests.

In this context, the negotiation traffic between EC2 instances and EBS volumes to determine the principal copy exploded, further straining the control plane.

June 2012: EC2, EBS, ELB, and RDS rocked by a thunderstorm

Following a power surge caused by a thunderstorm, two data centers serving the same AZ switched to generator power. The procedure failed in one of them because voltage stabilization could not be achieved. Servers ran on uninterruptible power until reserves were exhausted. The sequence occurred twice, as power briefly returned in between.

The troubled data center housed only a small portion of the region’s resources (roughly 7% of EC2 instances, for example). Yet the impact was significant for some customers, on two fronts. First, the unavailability of instances and volumes (restricted to the affected AZ). Second, degraded access to control planes across the east-us-1 region.

The EC2 recovery was slowed by a bottleneck at server startup. EBS faced a similar delay due to the difficulty in shifting rapidly to a new primary datastore.

The ELB service was also disrupted. After power returned, a bug occurred: the control plane attempted to scale the underlying instances. The resulting surge of requests overwhelmed the control plane, compounded by EC2 instance launch requests from customers, and the fact that the control plane used a shared regional queue.

RDS was heavily dependent on the recovery of EBS. A software bug also hindered failover on certain multi-AZ configurations.

October 2012: the EBS collection agent hits the wrong address

EBS nodes started encountering problems in an AZ due to a bug in the data-collection agent used for maintenance.

The week before, one of the servers hosting these data was replaced after a hardware failure. In this context, the DNS record was updated. The update did not propagate correctly, so some agent instances continued attempting to reach the old address.

The collection service was tolerant of missing data, so the issue wasn’t immediately visible. Until the agent started saturating the memory of the servers by repeatedly trying to connect. In the end, EBS servers could no longer respond to client requests. Since many failed at once, there were not enough spares to take their place.

The incident made it difficult to use management APIs. Moreover, the rate-limiting policy AWS put in place to ensure stability during the recovery phase was too aggressive.

June 2014: SimpleDB (almost) unreachable

For two hours, almost no API calls to this distributed datastore succeeded. It took additional time for everything to return to normal, particularly for creating and deleting domains.

The trigger was a power outage. Several storage nodes became unavailable. When they rebooted, load shifted to the internal locking service of SimpleDB.

This service determines which node set is responsible for a given domain. Each node regularly checks in to confirm it retains its responsibilities.

With the surge in load, latency increased and nodes could not complete the handshake before timeouts. They ended up ejecting themselves from the cluster. They could rejoin only with authorization from the metadata nodes — which were, unfortunately, also unavailable.

September 2015: DynamoDB overwhelmed by global secondary indexes

In September 2015, global secondary indexes (GSI) were still relatively new on DynamoDB. They enable access to tables via alternative keys.

DynamoDB tables are partitioned and distributed across servers. The assignment of a group of partitions to a server is called a membership. It is managed by an internal metadata service that the storage servers periodically query — for instance after a network interruption.

One such interruption occurred on restart. However, when services came back online, a portion of the metadata service responses exceeded acceptable time limits. Consequently, the affected data servers stopped accepting requests.

The adoption of GSIs added substantial load. They have their own partition sets, increasing the size of membership information. As a result, processing times for some queries exceeded timeouts. The effect worsened as a large number of data servers simultaneously issued requests, sustaining a high load.

AWS had to pause communication to buy time to add capacity (the ongoing load prevented administrative requests from being processed).

SQS, which relies on an internal DynamoDB table describing queues, was affected by the inability to refresh its cache properly. API errors also proliferated on EC2 autoscaling, which also uses DynamoDB (for group information and launch configurations). CloudWatch was not spared, particularly on the metrics side.

February 2017: S3 brought down by a debugging error

Initially, a debugging operation aimed at diagnosing a slowdown in S3’s billing subsystem kicked off the problem.

A command was intended to remove a handful of servers from a sub-system. But a faulty entry removed more servers than intended, and affected two other subsystems. One subsystem, called the index subsystem, managed metadata and object location information for all objects in the east-us-1 region. The other, the placement subsystem, handled allocation of new storage spaces.

It took restarting both subsystems, which hadn’t been rebooted in years. The integrity checks thus took longer than expected.

During the reboot, S3 could not respond to requests. Other regional services were affected during the API outage. Among them, launching new EC2 instances and creating EBS volumes from snapshots.

November 2020: Kinesis hits its thread ceiling

The trigger was adding capacity on the front-end fleet.

This fleet of servers routes traffic to the back-end. It handles authentication and throttling, while maintaining membership information (shard-map), of which each front-end server stores a copy.

These details are retrieved via calls to a microservice and by reading data from a DynamoDB table. They are also continuously updated by processing messages from other front-end servers, with one thread dedicated per server.

Many errors were not solely due to added capacity. The root cause, however, was the front-end fleet-wide breach of the maximum number of threads allowed by the operating system. Caches could no longer be built, and shard-maps became stale, preventing requests from being correctly routed.

Due to concurrency constraints, only a few hundred servers could be restarted per hour, while thousands existed. Resource contention occurred between the processes populating the shard-map and those handling incoming requests. Restarting too quickly risked conflicts and further failures.

Among the affected services were Cognito, which uses Kinesis Data Streams to collect and analyze API access patterns. CloudWatch, for logs and metrics, was also impacted — with cascading effects on Lambda (invoking functions relied on metrics published to CloudWatch).

December 2021: network fry

AWS relies on an internal network to host essential services like monitoring, DNS, authorization, and part of the EC2 control plane. Gateways connect to the main network where most of Amazon’s cloud services and customer applications operate.

Auto-scaling of a service on the main network triggered turbulence on the internal network. A spike in connections overburdened the gateways, increasing latency and errors. Repeated connection attempts compounded the congestion.

This congestion limited monitoring capabilities and thus the ability to resolve the issue. DNS traffic redirection helped, but did not solve everything. The delay grew as AWS’s internal deployment systems were also affected.

Among the affected services were the EC2 APIs used to launch and describe instances. RDS, EMR, and WorkSpaces were impacted by extension. Route 53 APIs were also hit (inability to modify records). On STS, latency increased for token federation with third-party identity providers. Access to S3 buckets and DynamoDB tables via VPC endpoints was also disrupted.

June 2023: the Lambda front-end overcapacity

Lambda operates as cells composed of subsystems. In each cell, the front-end handles invocations and routing while a handler provides an execution environment.

In June 2023, the front-end scaled up to absorb a traffic spike and reached a capacity plateau that had never been seen before in a given cell. This was sufficient to cause a software issue: the execution environments allocated were not being fully utilized by the front-end, causing more invocation errors to occur.

The incident affected the AWS Console in us-east-1. It also impacted STS (notably SAML federation error rates), EKS (cluster provisioning), and EventBridge (routing to Lambda exhibited up to about 801 seconds of latency).

July 2024: Kinesis disrupted by a workload profile

A problem also struck a Lambda-like cell for Kinesis Data Streams: a cell dedicated to internal AWS services rather than customer-facing workloads.

A backdrop of architectural modernization introduced a new management system. It failed to handle the particular workload profile of that cell, which consisted of a very large number of shards operating at very low throughput.

These shards were not well balanced across hosts. The few nodes that inherited the “big chunks” began sending large status messages that could not be processed within the allowed time. The management system, interprets those messages as potential node faults, started redistributing the shards. This triggered a spike of activity that overwhelmed another component used to establish secure connections to data-plane subsystems. The processing of Kinesis traffic suffered as a result.

Among the affected services were CloudWatch Logs (which uses Kinesis as a buffer), the delivery of events into S3, and PutRecord and PutRecordBatch API calls via Firehose. Cascading effects were observed in ECS, Lambda, Glue, and Redshift.