Daring Fireball: Cloudflare: 'Perplexity Is Using Stealth, Undeclared Crawlers to Evade Website No-Crawl Directives'

Cloudflare: ‘Perplexity Is Using Stealth, Undeclared Crawlers to Evade Website No-Crawl Directives’

The Cloudflare blog:

We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.

The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling. [...]

Our multiple test domains explicitly prohibited all automated access by specifying in robots.txt and had specific WAF rules that blocked crawling from Perplexity’s public crawlers. We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.

Perplexity has responded, accusing Cloudflare of incompetence and publicity-seeking:

Because Cloudflare has conveniently obfuscated their methodology and declined to answer questions helping our teams understand, we can only narrow this down to two possible explanations.

Cloudflare needed a clever publicity moment and we–their own customer–happened to be a useful name to get them one.

Cloudflare fundamentally misattributed 3-6M daily requests from BrowserBase’s automated browser service to Perplexity, a basic traffic analysis failure that’s particularly embarrassing for a company whose core business is understanding and categorizing web traffic.

Whichever explanation is the truth, the technical errors in Cloudflare’s analysis aren’t just embarrassing — they’re disqualifying. When you misattribute millions of requests, publish completely inaccurate technical diagrams, and demonstrate a fundamental misunderstanding of how modern AI assistants work, you’ve forfeited any claim to expertise in this space.

Perplexity’s response makes it sound like Cloudflare just doesn’t get how leading-edge AI chatbots work, and what users expect of them. But going back to Cloudflare’s post, they specifically cite OpenAI as an exemplar in respecting the directives of website publishers:

When we ran the same test as outlined above with ChatGPT, we found that ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. We did not observe follow-up crawls from any other user agents or third party bots. When we removed the disallow directive from the robots entry, but presented ChatGPT with a block page, they again stopped crawling, and we saw no additional crawl attempts from other user agents. Both of these demonstrate the appropriate response to website owner preferences.

And nothing in Perplexity’s response attempts to explain Cloudflare’s accusation that Perplexity is adopting a false generic user-agent when their own declared user-agents are disallowed. Seems shifty to me.

★ Tuesday, 5 August 2025