Cloudflare Outage: What Happened in the System

Cloudflare is clear: on November 18, it suffered its “worst outage since 2019.”

This latest disruption was triggered by the deployment of a WAF rule for XSS detection. A regex issue led to excessive CPU usage on the nodes handling HTTP(S) traffic. The main proxy, as well as the CDN, went down.

The bot-management system knocked out by a configuration change

This time, the incident took root in a change of permissions on a ClickHouse database. In broad terms, the aim was to make explicit an access that had previously been granted implicitly to users when querying system tables.

Without proper filtering, a query began producing duplicate columns. The query originated from one of the main proxy’s modules: the one dedicated to bot management.

Also read: Cloudflare unveils a resilience plan: the main pillars

This module uses, among other things, a machine learning model that assigns a score to each request. It relies on a configuration file gathering features (individual characteristics used to predict whether a request is automated or not).

This file is regularly refreshed – at intervals of a few minutes – and distributed across the Cloudflare network.
The “duplicated” version exceeded the 200 features limit configured in the bot-management system to prevent memory overuse. The module thus went into error, affecting all traffic that depended on it.

Cascading outages and an inaccessible dashboard

Other services using the main proxy were affected. In particular, Workers KV (a key-value store) and Turnstile (an alternative to CAPTCHA).
The unavailability of Turnstile prevented connections to the dashboard – unless you had an active session.
Cloudflare Access (access control) also experienced authentication problems.
Meanwhile, the CPU load of the debugging and observability systems increased CDN latency.

Around 2 p.m., about an hour and a half after the incident began, a fix was deployed on Workers KV to bypass the proxy. Error rates on downstream services fell.

Other difficulties were recorded later, after restoring a healthy version of the features file. The backlog of login attempts, combined with retries, overwhelmed the dashboard.

Cloudflare initially thought it was a cyberattack

Until the fix for Workers RV was applied, the system exhibited a peculiar behavior: repeatedly, it briefly recovered. The reason: a healthy file would sometimes be generated, depending on which part of the cluster processed the bot-management service’s request.

This behavior complicated problem identification. Until, eventually, all ClickHouse nodes began generating the wrong file.
Cloudflare briefly suspected an attack, especially since its status page, which does not rely on its services, had also gone down. But it was a “coincidence”…

Also read: Cloudflare, another Internet pillar down after AWS and Azure

Traffic routing had largely returned to normal by 3:30 p.m. By 6 p.m., all Cloudflare systems were functioning normally.

As a consequence of this global outage, the company pledged to strengthen the control over the ingestion of files that its systems generate (placing them on a par with user-generated files). It also plans to prevent dumps and other error reports from exhausting system resources. And to revise failure modes for error conditions across all modules of its main proxy.