Amazon Explains DNS Failure That Broke the Internet

Editorial Team 25th October 2025

632

amazon-web-services-aws

Amazon Web Services has released a comprehensive post-mortem explaining the technical failures that caused a massive service disruption on October 19 and 20, 2025. The incident, which affected over 2,000 companies and generated nearly 10 million outage reports worldwide, stemmed from a previously undetected flaw in automated systems designed to prevent exactly this type of failure.

The primary culprit was a race condition within the DynamoDB DNS management system. This automated infrastructure manages hundreds of thousands of DNS records that route traffic to database services across the AWS cloud. The system consists of two independent components: the DNS Planner, which monitors server health and creates routing plans, and the DNS Enactor, which applies these plans to live systems. Both were designed with redundancy and fault tolerance in mind, operating independently across three availability zones.

The failure occurred when unusual processing delays caused one DNS Enactor to fall behind while applying an older routing plan. Meanwhile, a second Enactor completed work on a newer plan and triggered a cleanup process to delete outdated plans. The timing created a critical vulnerability. The delayed Enactor overwrote the current plan with an outdated version just as the cleanup process deleted that same outdated plan from the system. This removed all IP addresses for the regional DynamoDB endpoint and left the system unable to self-correct.

AWS engineers discovered that safeguards meant to prevent such scenarios had become ineffective due to the unusual delays. A freshness check that should have blocked the outdated plan from being applied had become stale by the time it was evaluated. The automation entered an inconsistent state that required manual operator intervention, something the resilient design was specifically built to avoid.

The DynamoDB failure cascaded rapidly through dependent services. Without functioning database endpoints, systems across AWS infrastructure could not establish connections. The EC2 compute service encountered particularly severe problems when its Droplet Workflow Manager system entered congestive collapse. This component manages leases for physical servers hosting virtual machines. As leases expired during the DynamoDB outage, the system attempted to reestablish them all simultaneously once database access returned. The overwhelming workload caused the system to fail repeatedly, unable to complete tasks before they timed out and requeued.

Network Load Balancer services experienced a different but equally problematic failure mode. Health check systems began evaluating newly launched EC2 instances before their network configurations had fully propagated. This caused health checks to alternate between passing and failing states, triggering automated systems to repeatedly add and remove capacity from service. The oscillation increased load on health check infrastructure, causing it to degrade and creating a feedback loop of instability.

AWS has responded by disabling the problematic DNS automation globally until fixes are implemented. The company plans to correct the race condition, add protections against incorrect DNS plans, implement velocity controls on capacity removal during health check failures, and improve throttling mechanisms to prevent queue overload. Additional testing procedures will verify recovery workflows operate correctly under stress conditions.

Amazon acknowledged the severity of impact and committed to using insights from this failure to improve availability across its infrastructure.

Editorial Team 25th October 2025

632