Skip to main content
A single weathered steel chain link on a concrete surface, broken cleanly at one point — an editorial illustration of shared-fate risk in cloud infrastructure.

Cloud

The Outage You Can't Fail Over From

Technology LeadershipCloud
Charles Redding15 min read

The Moment You Realize You're Not in Control#

At 11:48 PM Pacific on October 19, 2025, a single entry vanished from one of Amazon's internal systems — the kind of entry that tells other Amazon services where to find each other.

No one intended it. An automated cleanup process inside Amazon removed an entry that another automated process had just put there. Think of it as one hand of the housekeeping team throwing away something another hand had just placed on the desk. For fifteen hours, over 3,500 companies across sixty-plus countries — including organizations you and your customers use every day — experienced varying degrees of "offline." More than four million outage reports were submitted in the first two hours alone.

I want to tell you I watched my team execute a beautiful, well-rehearsed failover that night. I can't. What I watched, like most technology leaders I spoke to in the weeks after, was a slow dawning that the event multi-region architecture was supposed to protect us from had just happened, and the architecture wasn't going to protect us.

This post isn't an "AWS is bad" hot take. It's an honest one. 2025 was the year four of the five companies that operate the scaffolding of the modern internet each had a day where the scaffolding wobbled. And the lesson — the Monday-morning one, the one that should sit on your desk whether you run a Fortune 100 or a five-person consultancy — is that the failover strategy you think you have is almost certainly not the one you actually have.

What Actually Happened in 2025#

Four incidents, in rough chronological order, anchor the year.

Amazon Web Services — October 19–20, 2025. Amazon's biggest U.S. data-center hub lost its internal "phone book" for fifteen hours. When any Amazon service — or any company built on Amazon — tried to look up where its own data or its own neighbors lived, it got no answer. The fix itself created a second problem: when the phone book came back, every service tried to reconnect at the same moment, and the sudden flood of reconnection attempts overwhelmed Amazon's ability to restore them in order. That second-wave problem extended parts of the outage by another twelve hours beyond the original repair. More than 3,500 companies across sixty-plus countries saw meaningful disruption. Four million outage reports were submitted in the first two hours. And the tools Amazon's own customers would normally reach for in a crisis — Amazon's web console, where most operational decisions get made — went down along with everything else. The leaders who thought they could "log in and fix it" couldn't log in.

Microsoft Azure — October 29, 2025. Ten days later, a routine configuration change inside Microsoft's global traffic-routing layer (the part of Azure that sits between websites and the people visiting them) was accidentally pushed in a broken state. A safety system that was supposed to catch exactly this kind of bad change had its own defect and let the change through. For about nine hours, Microsoft's own products — Outlook, Teams, Intune — stopped working normally for millions of users, and customer-facing systems at Costco, Starbucks, and Alaska Airlines went dark or degraded in parallel. Two days of partial disruption.

Cloudflare — November 18, 2025. Cloudflare sits in front of roughly a fifth of the internet, filtering out bots and malicious traffic before they reach the sites you visit. A permission change inside Cloudflare's own systems caused a configuration file — the one that tells Cloudflare which security checks to run on incoming traffic — to balloon to more than three times its normal size, past a limit the rest of the system had been built to respect. When Cloudflare's servers worldwide tried to load the too-big file, they crashed. Error pages everywhere for five to six hours. X, ChatGPT, and Spotify were among the visible casualties. Cloudflare publicly called it their worst outage since 2019.

Cloudflare again — December 5, 2025. Three weeks later, a thirty-minute event took out roughly 28% of Cloudflare's traffic. Small, relative to the November event. Still enough to briefly take LinkedIn, Zoom, Canva, and Shopify offline during business hours.

Google Cloud had a meaningful 2025 event of its own — a software defect that cascaded through the platform's core plumbing, visibly affecting Spotify, Gmail, and Fitbit for several hours.

One analysis pegged the combined direct losses from the AWS, Azure, and Cloudflare configuration-related failures alone at roughly $581 million.

None of these were acts of God. Every single one was a routine change — a configuration tweak, a permission update, a software push — that a human pressed "go" on. Every single one was, in the strict sense, preventable. And every single one still happened at companies with world-class engineering teams and essentially unlimited budgets for reliability.

The Uncomfortable Truth: Shared Fate#

Here's the part that gets lost in most "build a resilient architecture" blog posts.

Your application might be beautifully set up across multiple Amazon data centers inside the same region. You might have duplicate, synchronized copies of your data in each of them. You might have a practiced plan for when one data center fails. You might have tested it.

When the region itself loses its internal phone book, none of that matters.

It doesn't matter because the event that hit you wasn't a data-center failure. It was a failure of the layer that sits above all the data centers — the layer that tells them how to talk to each other and to the outside world. Your data centers were fine. The thing on top of them wasn't. Most companies' disaster-recovery plans are built for the first kind of event, not the second.

And even if you were fully multi-region, you still share fate with your vendors. Your login service is probably Okta or Auth0. Your payments are Stripe. Your deploys go through GitHub Actions. Your observability is Datadog or New Relic. Your transactional email is SendGrid or Postmark. Your chat is Slack. Your identity provider, your build pipeline, your metrics, and your customer-support inbox are all, statistically, running in us-east-1 or fronted by Cloudflare.

When us-east-1 goes down, a meaningful percentage of your dependency graph goes with it. You don't own that graph. You can't fail it over. You can't even inventory it cleanly in most organizations, because modern SaaS makes hosting-region disclosure optional and most vendors don't volunteer it.

This is what we mean by shared fate. You and your neighbors are on the same boat. When the boat takes on water, it doesn't matter how good your swimming is.

Why "Multi-Region" Isn't a Failover Strategy#

Most organizations that tell me they are "multi-region" are actually running inside multiple data centers within a single region. That's not the same thing, and the distinction is the one that matters during events like the October AWS outage.

True multi-region means running live, active infrastructure in two or more fully independent regions of the same cloud provider — for example, Amazon's Virginia region and Amazon's Oregon region — with your data kept in sync in both, traffic automatically switching between them when one fails, and a quarterly practice that proves the switch actually works. It is eye-wateringly expensive. It slows down every transaction that needs data to stay consistent across both coasts. And the organizational discipline required to keep both regions genuinely equivalent — same software versions, same features turned on, same monitoring — is more than most companies sustain past the first year.

True multi-cloud — running in two different cloud providers, for example Amazon and Microsoft — is harder still. Different tools, different ways of handling identity, different ways of pricing. Very few organizations do it well. The ones who do pay for it with headcount, complexity, and a meaningful slice of their engineering capacity redirected away from building product.

And here's the part nobody writes down: even if you successfully go multi-region or multi-cloud, you still depend on a small number of other companies further down the stack — the ones that route internet traffic, issue the security certificates browsers trust, and deliver content from "edge" servers close to your users. The internet is narrower than the diagram on your wall suggests. There are a handful of single points of failure at the layer below your cloud provider, and you don't get to pick them.

So when a CEO or a board member asks, "Are we multi-region?" the honest answer for most organizations is: "We have a disaster-recovery plan that protects us if a single data center goes down inside our region. We have never tested it against the scenario where the entire region's coordinating layer goes down. If Amazon's Virginia region has an outage tomorrow, we're offline for however long Amazon takes to fix it, and our recovery plan is 'wait.'"

That is a fine answer. What's not fine is pretending it's a different answer.

When the Steering Wheel Is Also Broken#

One detail from the October 2025 AWS event deserves its own paragraph.

The cloud provider's own web console — the dashboard Amazon's engineers and customers use to look at what's running, adjust settings, and reroute traffic in an emergency — went down during the outage. The tools you would reach for to fix the problem, or even to see what was broken, were themselves broken. Think of it as the dashboard and the engine sharing enough of the same wiring that when the engine fails, the dashboard fails with it. You're sitting in the driver's seat, and the steering wheel won't turn.

This is not unique to Amazon. Microsoft's October 29 event degraded the Azure admin portal. Cloudflare's November 18 event degraded the Cloudflare dashboard. All three times, the companies affected discovered that the management tool they'd mentally assigned to the role of "the thing I use in a crisis" was not usable in a crisis.

If your incident-response plan contains the step "log into the cloud console and adjust the routing," you should know that during the worst kind of event — the kind where you most want to adjust the routing — the console is the least reliable thing in your toolkit. The action most leaders imagine themselves taking during an outage is the action most likely to be impossible to take.

This is why tabletop exercises matter more than failover drills for most organizations. A failover drill tests whether your automated systems can execute a scripted transition. A tabletop exercise tests whether your humans can make decisions when the tools they expected to use are unavailable. Run the scenario honestly: the cloud console won't load, the CLI is returning errors, your observability tool is hosted on the same region that's down, and your customers are asking when you'll be back. Who makes the call? On what information? Through what communication channel? I have sat in rooms where the answer to those questions was "we don't know" — even at companies that passed their SOC 2 audit the week before. Pass the audit and still do the tabletop. They test different things.

The Three-Segment Impact#

The conversation I want to have with every CEO, CTO, and risk leader reading this is different depending on the size of the organization. Here's the honest version at three altitudes.

Large Enterprise (1,000+ employees, regulated, board-level risk oversight)

You are the only segment with the budget and staffing to seriously attempt multi-region or multi-cloud. But budget doesn't make it easy. The real levers for you are contractual and governance-oriented:

  • Push your top-ten SaaS vendors to disclose hosting region and commit to cross-region DR in writing. Most won't, willingly. Make it a procurement requirement.
  • Put cloud concentration risk on your board risk register with the same seriousness as geopolitical or cyber risk. The question "what's our exposure to a us-east-1 event?" should have a named owner and an annual review.
  • Force-test your DR plan against a regional control-plane failure scenario, not just a zone failure. Your existing DR exercises almost certainly don't cover this.
  • Renegotiate SLA clauses to include control-plane events and service-credit language that actually compensates for realized loss rather than token credits. The typical 99.99% cloud SLA pays you back pennies on the dollar of a real outage.
  • If you're in financial services, healthcare, or critical infrastructure, your regulator is already thinking about this. The 2025 events will show up in guidance. Get ahead of it.

Mid-Size (100–999 employees, meaningful engineering team, growing exposure)

You have the engineering talent to architect resiliently but not the budget to go fully multi-region. Your honest strategy is disciplined single-region done right, plus excellent communications:

  • Define your real RTO and RPO, write them down, and tell your leadership team out loud what they are. Not the marketing number. The real one.
  • Build the runbook assuming your cloud console is unreachable. Put critical credentials, contacts, and decision authority in a document that lives outside your cloud.
  • Invest in a customer-communications playbook that can run without your stack. A status page hosted on a different provider, an email list exportable to a secondary tool, a Slack or Teams channel that doesn't share fate with your product.
  • Know your third-party concentration. Pull the list of your top twenty SaaS vendors and research each one's hosting footprint. You won't like what you find. The finding itself is the work.
  • Don't buy multi-cloud until you've gotten disciplined single-region right. It's the less glamorous version of resilience and also the version that actually compounds.

Small & Growing (under 100 employees, lean team, limited DR budget)

You can't afford multi-region. You probably can't afford robust DR testing. That's fine. Your honest strategy is transparency and trust banking:

  • Accept that a regional cloud event will take you offline for as long as it takes the provider to fix it. Write this down. Stop pretending otherwise.
  • Pre-draft the email you will send to customers during an outage. Store it somewhere reachable without your stack. The difference between "we're down and we have no idea" and "we're down, here's what we know, here's when we'll update you" is enormous in customer-trust terms.
  • Bank trust during the good days. Customers forgive outages from vendors who communicate well and have a track record of honesty. They do not forgive vendors who go dark.
  • Be honest in your sales process. If a prospect asks what your DR posture is, tell them the truth. "We run in a single region. A regional outage will affect us. Our commitment is fast, transparent communication." Buyers who demand better than that are not your customers yet.
  • Consider whether your data and code are portable enough to move if you had to. You don't need to move. You need to know you could.

Every segment shares one thing: the failure mode of bad incident communication is worse than the failure mode of the outage itself. Customers will forgive an outage. They will not forgive silence.

What Leadership Should Actually Do on Monday#

If you're reading this on Monday morning and wondering what the one thing is, it's this: treat cloud concentration as a board-level, named risk with a named owner. Not an IT risk. A business risk.

Concretely, that means:

  1. Inventory your dependency graph. Not just your cloud provider — your top twenty SaaS vendors and what region they run in. Most of this information you will have to ask for. Ask anyway.
  2. Write down your real RTO and RPO. Your actual ones, not your aspirational ones.
  3. Rebuild your incident-comms plan assuming your cloud is the thing that's down. Your status page, your customer-email tool, your internal Slack — move any of them that currently shares fate with your product.
  4. Put cloud concentration on the risk register with a named owner. Make it a standing agenda item at the quarterly risk review.
  5. Don't buy your way out of the problem. Expensive multi-cloud purchases made in a panic after an outage tend to age badly. Discipline first. Architecture follows.
  6. Run a leadership-level tabletop, once a year, with the console unreachable as a premise. Not a technical drill — a decision-making exercise for the executive team and the communications lead. The technical team should be in the room as advisors, not facilitators. The questions you're testing are "who decides what to tell customers" and "at what threshold do we escalate to board-level communication" — not "can we spin up a secondary region."

If you already have a mature compliance program — SOC 2, ISO 27001, NIST CSF 2.0, HIPAA — your auditors are going to start asking about cloud concentration and third-party concentration in 2026. The 2025 events have changed how regulators and auditors think. Use the audit cycle as a forcing function rather than fighting it.

Where This Connects to DLegendDigital#

I won't make this a pitch. Most of the work above is discipline work, and no toolkit replaces discipline.

That said, two of our products are useful as reference documents for leaders starting this work. The NIST CSF 2.0 Readiness Toolkit includes the IDENTIFY and PROTECT function mapping that gives you a structured way to inventory third-party concentration and classify critical dependencies. The SOC 2 Readiness Toolkit's vendor-management controls (CC9.2, CC7.2) give you the formal framework for documenting what we described above — the concentration register, the disclosure expectations, the SLA review cadence. Either toolkit, paired with a couple of afternoons of honest inventory work, will give most mid-sized organizations a defensible starting posture.

If you're further along and want a practitioner to help architect the multi-region or multi-cloud strategy properly, that's what our PBF consulting engagements are for. The Cloud Architecture Review is a scoped engagement that produces a documented current-state concentration map, a target-state architecture with explicit cost and operational trade-offs, and a phased migration plan. It's not cheap. Nothing that protects you from a fifteen-hour us-east-1 event is cheap. But the alternative — which is what most organizations have today — is not free either; it's just invisible until it isn't.

Most readers of this post don't need that engagement. Most readers of this post need to do the first three items on the Monday list and tell the truth to their board. That's the highest-leverage work in cloud resilience right now.

The Honest Close#

I'll end where I started.

At 11:48 PM Pacific on October 19, 2025, a race condition in an internal AWS service deleted a DNS record that should not have been deleted. Fifteen hours later, the internet that millions of people depend on was back. Nobody died. Most businesses recovered. The companies that handled it best were the ones that had already accepted, before the outage, that they could not fail over from this class of event — and had built their customer relationships, their communications discipline, and their operational runbooks around that truth.

The failover that doesn't exist is not a reason to panic. It's a reason to stop pretending. The leaders who will come out of 2026 strongest on this front are not the ones who bought their way to multi-cloud this week. They are the ones who are putting cloud concentration on the risk register, telling their board the truth, and investing in the parts of resilience that don't show up on an architecture diagram — the comms plan, the runbook, the trust-banking, the vendor pressure, the honest conversation.

That's the Monday-morning takeaway. See you in two weeks.

— Charles

#cloud#resilience#dns

About the author

Charles Redding

Founder of DLegendDigital. 35+ years of enterprise technology leadership across audit, risk management, cybersecurity, and AI. Former CIO, VP of Technology, and Director at organizations ranging from high-growth startups to $4.3B global enterprises.

The Current

Get next week's brief before it hits the blog.

Practical cybersecurity, AI governance, and compliance notes — one email, every Friday.

Free. Unsubscribe any time.

Keep reading

All posts →