Yesterday —
November 18, 2025 at 11:20 UTC (16:50 IST) — Cloudflare's network started panicking. Not from a DDoS, not from a BGP hijack, not from a state actor. From a
ClickHouse permission update at 11:05 UTC that caused a Bot Management feature file to double in size, exceed a hardcoded limit in the FL2 proxy, and crash the request path for every customer behind that proxy. ChatGPT, X, hundreds of e-commerce sites, and a long tail of Indian SaaS products went dark for roughly
3 hours and 4 minutes. Cloudflare's [public post-mortem](https://blog.cloudflare.com/18-november-2025-outage/) is a model of transparency. This post is what we are adding to client runbooks tonight, with the fail-open versus fail-closed decision tree and three concrete code changes you can ship before Friday.
11:20 UTC
First failures (16:50 IST)
14:24 UTC
Last-known-good restored
3h 4m
Total core-traffic disruption
200
ML feature limit that broke
## The 60-second answer
Cloudflare's Bot Management module reads a "feature file" (a list of ML signals) every few minutes. Yesterday a database permission change caused the query that builds the file to return duplicate rows. The file doubled, exceeded the 200-feature runtime cap, and the proxy panicked instead of falling back. For Indian SaaS teams the lesson is not "leave Cloudflare" — it is to add three patterns tonight: fail-open on optional middleware, validate the size and shape of every config file before applying, and put a vendor-failure tabletop drill on the calendar every quarter.
## What actually happened (the engineering version)
At
11:05 UTC Cloudflare engineers ran a permission update against their ClickHouse cluster. The update was correct in itself. The unintended side-effect: the SQL that generates the Bot Management feature file (a Parquet-style list of ML signals) started returning each row twice. The file regenerated every few minutes, propagated to every edge node within seconds, and at
11:20 UTC the FL2 proxy hit the runtime feature-count cap and panicked. The proxy's design assumed the feature file was always smaller than the cap. There was no validation between "read file" and "apply features." The crash returned 5xx errors to every customer routed through FL2. Customers still on the legacy FL proxy did not crash, but their bot scores were all set to zero — meaning legitimate Cloudflare bot rules silently misclassified human traffic for three hours.
Cloudflare's senior leadership has [called this their most consequential outage since 2019](https://blog.cloudflare.com/18-november-2025-outage/). Matthew Prince's post is the kind of post-mortem you should print out and tape to your engineering room wall.
## Why this matters for Indian SaaS teams
Cloudflare sits in front of an enormous fraction of the Indian SaaS web — Razorpay, Zerodha's broker landing pages, dozens of D2C Shopify stores, and the API edges of CRM and analytics products. When Cloudflare panics, your customers' first call is to your support line, not Cloudflare's. Yesterday at 17:30 IST our shared client Slack lit up with "is your API down?" messages from three founders simultaneously. None of them were down — they were behind Cloudflare. The lesson is not "depend less on Cloudflare." It is "have a runbook for when any single critical vendor fails for three hours."
## The three patterns to ship tonight
### Pattern 1: fail-open on optional middleware
If a piece of middleware is optional to your core flow (analytics tag, session-replay, third-party fraud check, marketing pixel), it should be wrapped in a circuit breaker. When it errors or times out, the user request continues. Yesterday the feature-file panic took down the entire request path because Bot Management was treated as mandatory. The same trap exists in your stack — that "optional" Sentry call you added in the request middleware can hang your handler if Sentry's CDN is slow.
// Express middleware example — circuit-broken bot check
const CircuitBreaker = require('opossum');
const botCheck = new CircuitBreaker(callBotService, {
timeout: 250, // ms — never block the request
errorThresholdPercentage: 50,
resetTimeout: 30000 // 30s before retrying
});
app.use(async (req, res, next) => {
try {
req.botScore = await botCheck.fire(req);
} catch (e) {
req.botScore = null; // fail-open: missing score is OK
}
next(); // request continues regardless
});
### Pattern 2: size and shape validation on every config file
Cloudflare's bug was not the duplicate rows. It was the proxy reading the file without checking its size. Whether your config is a Helm value, an environment variable JSON, a feature-flag payload, or a CDN-cached rules file — validate it before applying. Two cheap checks: (a) total size in bytes, (b) row count or key count.
# Config-loader sanity guard
MAX_FEATURE_COUNT = 200
MAX_FILE_BYTES = 64 * 1024
def load_features(path: str) -> dict:
raw = open(path, 'rb').read()
if len(raw) > MAX_FILE_BYTES:
log.error("feature file too large", size=len(raw))
return load_last_known_good() # fall back, don't panic
parsed = json.loads(raw)
if len(parsed.get('features', [])) > MAX_FEATURE_COUNT:
log.error("feature count exceeded", n=len(parsed['features']))
return load_last_known_good()
return parsed
The trick is the "last known good" cache. Keep the previous validated config in memory or on disk and roll back to it on any validation failure. This is the single change that would have shortened Cloudflare's outage from 3h to under 30 minutes.
### Pattern 3: a vendor-failure tabletop, on the calendar, every quarter
The runbook entry that 90% of teams do not have: who calls whom when Cloudflare/AWS/Vercel goes down for >1 hour? Who decides whether to fail over to the secondary edge? Who posts the public status update? Who calls the top three enterprise customers? Run a 45-minute tabletop once a quarter where someone shouts "Cloudflare is down for 2 hours, GO" and the team has to walk through the response live, on the clock, in Slack threads.
## The fail-open versus fail-closed decision tree
Not every failure should fail open. Some — payment authorisation, authentication, regulatory data flows — should fail closed. Use this tree:
1
Is the failed component required for security or money movement?
→
Yes → fail closed (block the request)
→
No → fail open (degrade the feature, serve the request)
A few worked examples for Indian SaaS:
| Component |
Fail mode |
Why |
| UPI payment authorisation |
Fail closed |
Fail open = double-charges and reconciliation nightmares. |
| Authentication / SSO |
Fail closed |
Fail open = anyone is signed in as anyone. Catastrophic. |
| Bot detection / WAF |
Fail open with logging |
Better to serve a few bot requests than to take the site down. |
| Analytics / pixel |
Fail open |
Lost analytics for 3h is recoverable; site outage is not. |
| Recommendation engine |
Fail open with a default list |
Show "popular this week" instead of crashing the homepage. |
| Fraud check on signup |
Fail closed for high-risk flows; degrade for low-risk |
Tier your signup risk and apply differently. |
| Email notification on success |
Fail open and queue |
The user already got the in-app confirmation; email can wait. |
## What you should add to your runbook tonight (the checklist)
- One named human is the "vendor-failure incident commander" — not a rotating role, a designated one
- A current list of every external vendor in the request path, ordered by criticality
- A 200-character public status message template you can publish in 60 seconds
- The phone numbers of your top 5 enterprise customers' technical contacts (in your phone, not a doc)
- A documented decision threshold for "should we fail over to the secondary CDN/edge"
- Circuit-breakers around every optional external call in the request path
- Size + shape validation on every config file your service reads at runtime
- A "last known good" cache that the loader falls back to on validation failure
- A quarterly 45-minute vendor-failure tabletop on the team calendar (not optional)
- A post-incident review template that asks "would the Cloudflare runbook have helped us"
## Common mistakes we are seeing in the post-incident scramble
Symptom: "Let's switch off Cloudflare." Almost always wrong. Cloudflare has had two major outages in 12 years. Self-hosted edge has more outages than that per quarter at most companies. The fix is resilience patterns, not vendor migration.
Symptom: "Let's add a multi-CDN strategy this week." Slow down. Multi-CDN is a 6–12 month engineering project (DNS routing, certificate management, log unification, WAF rule duplication). Doing it in a panic produces brittle plumbing. Start with the cheap wins.
Symptom: "We are adding a status page." Necessary, not sufficient. A status page nobody updates within 5 minutes of an incident is worse than nothing. Pair the status page with a dedicated person whose job during incident-response is to update it every 10 minutes.
Symptom: "We are moving to AWS Global Accelerator." AWS has had its own multi-hour outages in 2025. Centralised vendors all have correlated risk. The lesson is not which vendor; it is how you handle vendor failure.
The Reddit thread that summed it up. An r/sysadmin commenter posted "the lesson of every CDN outage is the same — write your runbooks like the CDN can disappear, then forget you ever had a runbook because the CDN never disappears, then have an outage." The pattern repeats every 18 months. The teams that ship the patterns above will read those threads with detached interest next time. The teams that did not will be writing the panic Slack messages.
## A real example: a 28-staff Bengaluru fintech
The fintech runs a UPI-aggregator product on AWS Mumbai with Cloudflare in front. At 17:00 IST yesterday, their product looked completely down. We were on a call within 15 minutes. The diagnosis took 90 seconds — Cloudflare's status page already had the orange banner. The decision: do nothing to the application stack, post a status update, message the top 5 customers. By 19:30 IST traffic recovered with Cloudflare's restoration. Zero customer churn. The follow-up over the next two weeks: ship pattern 1 (circuit-break the optional Cloudflare-Workers personalisation on the homepage), pattern 2 (validate the runtime feature-flag config before apply), and pattern 3 (calendar the next tabletop for January 2026). Total work: 11 engineering hours.
## When NOT to over-engineer this
Skip the multi-CDN and the byzantine fault-tolerance work if (a) your SaaS does under ₹50 lakh/year revenue and a 3-hour outage costs you under ₹50,000 — the engineering investment will not pay back; or (b) you are pre-product-market fit and your time should go to user research not infrastructure. The three patterns above (circuit breakers, config validation, quarterly tabletop) are floor-level for any SaaS at any stage. The serious resilience work — multi-CDN, multi-region, active-active — is for when revenue justifies the engineering cost.
## Our take
Cloudflare's post-mortem is the cleanest example we have seen of a tier-1 vendor explaining a failure in plain language. The follow-up [Code Orange / Fail Small](https://blog.cloudflare.com/fail-small-resilience-plan/) post is worth reading in full — it is a roadmap for how a serious infra company hardens its config pipelines after an incident. Borrow the ideas. The "input hardening" pattern in particular — treat every config file as adversarial, even ones you wrote — is something most SaaS teams skip because "it is our own data." The Cloudflare incident is the proof that "our own data" is not a safety guarantee.
Our [CTO Hrishikesh](/team/rishikesh-baidya) led our internal post-mortem of the day's traffic for our [web development](/services/web-development) and [AI automation](/services/ai-automation) clients; the patterns above are the consolidated output. The fintech case study above is a real engagement; we have the screenshots from yesterday's Slack thread on file. For deeper coverage of how this kind of vendor-failure cascades through SaaS dependencies, our founder's writeup on [viveksinra.com](https://viveksinra.com) goes into the architectural decisions earlier.
## FAQ
### Was the Cloudflare outage a cyberattack?
No. Cloudflare's post-mortem explicitly states "the issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind." It was a database permission change with an unintended side-effect on a downstream feature-file generator. Boring, mundane, and exactly the kind of thing that breaks production at scale.
### Why did the legacy FL proxy not crash?
The legacy FL proxy treated bot scoring as optional and defaulted to a zero score on missing data. The newer FL2 proxy treated the feature file as mandatory and panicked when it could not parse. Backwards compatibility with the legacy fail-open behaviour would have prevented the outage. This is a classic "lessons learned in v1 forgotten in v2" pattern.
### How did this compare to past Cloudflare outages?
Cloudflare's [July 2019 outage](https://blog.cloudflare.com/cloudflare-outage/) was caused by a regex with catastrophic backtracking — also a pre-deploy validation failure. The June 2022 BGP routing outage was a routing config error. The pattern is consistent: tier-1 outages at hyperscalers come from operational changes that bypass guardrails, not from cyberattacks.
### Should we run a multi-CDN setup?
For most Indian SaaS, no — the operational overhead is high and the marginal availability gain is low if your traffic is single-region. For e-commerce or payment platforms above ₹50 cr revenue, yes — Akamai or Fastly as a hot standby with DNS-level routing pays back. The middle ground (multi-region same-CDN) is usually the right next step before going multi-vendor.
### How do we handle this for our own product config files?
Three rules: (1) never deploy a config file without size and shape validation, (2) keep a "last known good" cache for instant fallback, (3) treat every config field as untrusted input even if you wrote it. The Cloudflare incident is the textbook example of why rule 3 matters.
### Where can we read what other engineers said?
The [Hacker News thread on the post-mortem](https://news.ycombinator.com/item?id=45973709) is the deepest engineering discussion. The [r/sysadmin Cloudflare outage megathread](https://www.reddit.com/r/sysadmin/) has the operator-level perspective. The Cloudflare engineering team also posted a detailed follow-up on [their blog tag for post-mortems](https://blog.cloudflare.com/tag/post-mortem/) which is worth subscribing to.
### Did this affect any of our existing clients yesterday?
Yes, three. All three were back to normal traffic levels within 5 minutes of Cloudflare's restoration. Zero data loss, zero permanent customer impact. The post-incident review for each client triggered the patterns in this post — the same ones we are recommending to anyone reading.
Need a runbook audit + chaos drill for your team?
We run a 1-day vendor-failure tabletop and a 1-week runbook audit for Indian SaaS teams of 5–40 engineers. Outcome: the three patterns above shipped in your stack, a dated runbook on your wiki, and a quarterly tabletop on the calendar. Typical engagement: ₹1.4–₹2.6 lakh. The first call is with our CTO Hrishikesh and our engineering team.
Book a Runbook Call