Post-Incident Analysis: AWS US-EAST-1 Outage (October 20, 2025)
1. Incident Overview and Scope of Impact
The major AWS service disruption on October 20, 2025, began late on the previous night and lasted for nearly 15 hours, causing widespread issues globally.
2. Root Cause and Cascading Failure
AWS identified the event as a complex, cascading failure chain originating from internal systems within the US-EAST-1 region.
The Chain of Events
Initial Trigger (DNS Failure): The outage was initially triggered by a DNS resolution issue affecting the regional API endpoint for DynamoDB. This failure prevented dependent services and applications from locating the DynamoDB service.
EC2 Internal Impairment: After the initial DynamoDB DNS issue was resolved, the EC2 internal subsystem responsible for provisioning and launching new virtual machines became impaired. This was due to its reliance on DynamoDB for essential metadata retrieval, preventing new EC2 instances (and services like ECS/Fargate that rely on them) from starting.
Network Congestion and NLB Failure: As services tried and failed to communicate, a load storm ensued. This resulted in failures in the Network Load Balancer (NLB) health checks, which further degraded network connectivity across critical services like Lambda and CloudWatch.
In summary, a seemingly isolated DNS issue for a core database service rapidly caused subsequent failures in compute provisioning and internal networking, paralyzing control plane operations across the region and beyond.
3. Troubleshooting and Resolution
The resolution involved a multi-stage process of isolation, throttling, and recovery.
DNS Fix: AWS engineers quickly corrected the DynamoDB DNS resolution issue.
Throttling: To prevent the cascading failures from worsening and to stabilize the internal network, AWS took the critical step of throttling certain operations, specifically limiting requests for new EC2 instance launches and slowing down queue processing for Lambda functions.
Systematic Restoration: Mitigation steps were applied to restore the Network Load Balancer health checks and recover the EC2 internal subsystems.
Gradual Recovery: As internal health improved, AWS gradually reduced the throttling limits. Instance launches and other services slowly returned to pre-event levels across the affected Availability Zones, followed by the backlog of queued requests being fully processed.
4. Best Practices for Customer Resilience (Vendor Users)
The outage highlighted that while Multi-AZ deployment protects against a single data center failure, a Regional-level failure requires more extensive architectural planning. To avoid significant impact from future regional outages, AWS users should adopt the following strategies:
A. Prioritize Multi-Region Architecture
B. Decouple and Isolate Dependencies
Decouple Control Plane Operations: Recognize that core services like IAM and STS may rely on US-EAST-1. Design your application to tolerate the loss of the control plane (e.g., cannot launch new resources) while maintaining the data plane (already running resources).
Utilize Regional Endpoints: Explicitly configure AWS SDKs and tools to use Regional endpoints instead of global ones where available, reducing reliance on US-EAST-1 for region-specific operations.
Cross-AZ Deployments (Tier 1 Baseline): Always ensure that critical services (EC2 Auto Scaling Groups, RDS, Load Balancers) are distributed across a minimum of three Availability Zones within your primary Region.
C. Enhance Observability and Communication
External Status Page: Do not host your incident communication channels (status page, recovery documentation) within the same AWS Region, or even on AWS itself. Use a separate service or provider to guarantee communication during a full regional outage.
Define RTO/RPO: Clearly define your Recovery Time Objective (RTO—how long can you be down) and Recovery Point Objective (RPO—how much data loss is acceptable) to justify the cost and complexity of Multi-Region solutions.
This incident reinforces the lesson that resilience is not the prevention of failure, but the architectural ability to survive it.
The video below discusses the advanced considerations and trade-offs when designing applications to span multiple geographic AWS regions for improved resilience.
Best practices for creating multi-Region architectures on AWS is a helpful resource for understanding the different models like active-active and active-passive deployments.
No comments:
Post a Comment