Daring Fireball: Cloudflare CEO Matthew Prince Explains, in Detail, and Apologizes for Yesterday's Global Outage

Cloudflare CEO Matthew Prince Explains, in Detail, and Apologizes for Yesterday’s Global Outage

Cloudflare CEO Matthew Prince:

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file. Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare’s importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today.

This post is an in-depth recount of exactly what happened and what systems and processes failed. It is also the beginning, though not the end, of what we plan to do in order to make sure an outage like this will not happen again.

Everything about this incident exemplifies why Cloudflare is one of my favorite companies in the world. Ideally, it wouldn’t have happened, but shit does happen. Among the things to note about Cloudflare’s response:

They identified and fixed the issue quickly.
They issued frequent updates to their status site while the incident remained ongoing.
They published this postmortem within 24 hours. (That’s remarkable, given the technical breadth of the postmortem. Publishing this tomorrow, within 48 hours of the incident, would have been a praise-worthy accomplishment.) Update: Actually, according to Prince, commenting on Hacker News, the postmortem was published less than 12 hours after the incident began. Amazing.
The postmortem starts with a cogent, well-written layperson’s explanation of what happened and why.
The postmortem expands to include very specific technical details, including source code.

Lastly, it’s worth noting that Prince put his own name on the postmortem (and wrote much of it himself, using BBEdit), and closed with this apology, taking personal responsibility:

An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past it’s always led to us building new, more resilient systems.

On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today.

This is how it’s done.

★ Wednesday, 19 November 2025