Cloudflare’s pay-per-crawl is built to fail. Here’s why

When Cloudflare announced its new Pay-Per-Crawl marketplace, some people saw a breakthrough. The idea is that if AI companies want to crawl your website to train their models, they should compensate you for the use of your content. As the CEO of a legal AI company recently sued for scraping public data, I’d love for this to work.

However, it won’t, at least not in this way.

Last year, my company, Caseway, was sued by CanLII, the operator of Canada’s free legal decisions database, for allegedly using publicly available court data without a license. I’ve had a front-row seat to the vagueness of the legal rules surrounding AI scraping. And I’ve watched the wave of litigation since. The New York Times sued OpenAI and Microsoft for using millions of their paywalled articles to train GPT-4.

News Corp went after Perplexity for scraping Wall Street Journal content to generate answer pages. GitHub Copilot faces class actions from developers whose open-source code was ingested without attribution. Even Reddit sued Anthropic for allegedly training Claude on its forums without consent.

Scraping is how the AI industry was built, at least for many AI companies.

At first glance, Cloudflare’s new system appears to be a step forward. The company sits in front of 20% of the internet, so if anyone can enforce crawl permissions at scale, it’s them. Cloudflare states that websites can now block AI crawlers by default and require them to pay for each page request. Instead of an arms race over bot blockers and sneaky scrapers, maybe there’s a chance to align incentives.

However, this marketplace makes two significant mistakes and overlooks one even more substantial issue.

Not All Pages Are Equal

The first issue is pricing. Right now, Pay-Per-Crawl treats every page as a billable unit. But come on, a Pulitzer-winning investigation that lasted six months doesn’t have the same value as a transcript of a traffic court decision already in the public domain, which a website like CanLII didn’t even create (a judge made it).

Publishers that invest millions in original journalism or spend years on documentation and research also won’t settle for a flat crawl fee that applies to a government form or FAQ page.

Cloudflare’s system doesn’t account for that nuance. So most AI companies (including my company, Caseway) won’t buy in. Why would we pay premium rates for generic content that they can get elsewhere or that they’ve already ingested from Common Crawl? Or, more importantly, that a website is hosting the content on behalf of others, and they are a non-profit?

Meta disclosed that 67% of its LLaMA 1 model was trained on Common Crawl data, which is raw web content collected without payment or consent. OpenAI’s GPT-3 also used hundreds of billions of tokens from Common Crawl. These datasets are massive, free, and already full of scraped content from across the web. Unless you’re offering something significantly better or are legally required to pay, why would an AI firm suddenly switch to paying by the page?

And that brings us to the second problem.

Enforcement Is a Fantasy

Let’s say you’re a serious artificial intelligence lab or company. You’ve seen the lawsuits, and you want to stay compliant. Cloudflare’s Pay-Per-Crawl system might help you track access and pay for what you use.

But that’s not who Cloudflare needs to stop. The AI companies most likely to abuse your content aren’t going to sign up, add a payment method, and politely negotiate crawl rights. They’ll simply spoof their user agent, rotate IP addresses, or use a third-party proxy (maybe in India or China) to obtain the data anyway. And there’s nothing Cloudflare can do about it once the traffic appears to be from a human browser or a generic scraper.

Will a non-profit like CanLII pursue a company in Shanghai? Good luck convincing a judge in China to care about free court decisions in Canada.

According to Digiday, media companies like Skift saw OpenAI’s GPTBot hit their sites over 50,000 times a day despite explicitly disallowing it in their robots.txt files. Ziff Davis (owner of PCMag and Mashable) reported that OpenAI’s crawler increased its activity even after being told to stop. And Wikimedia said AI scrapers caused a 50% surge in bandwidth costs this year alone.

So, enforcement depends entirely on good faith. But that’s wishful thinking.

Publishers Need Leverage, Not Just Permission

I get why publishers are excited about Pay-Per-Crawl. I’ve been in this business long enough to see how the value chain’s been flipped. I previously ran a lawyer review platform with over 1.1 million lawyers. Traffic, discovery, and reputation are used to drive value. However, now AI platforms are building sticky interfaces that pull answers directly from content, eliminating the need for a single visitor to return.

Cloudflare’s marketplace attempts to address this, but it remains built on the premise that consent and compensation are optional. If AI companies want to train on your data, they’ll pay. If not, they won’t.

What publishers need isn’t a crawler paywall. They need actual leverage, which includes legal clarity, enforceable rules, and collective bargaining power.

Some of that might come through the courts, but I doubt it. The pace of litigation is glacial. More promising are industry coalitions advocating for default protections, such as requiring opt-ins, licensing standards, or even machine-readable “do not train” signals. There are also startups like Tollbit that enable publishers to detect AI bots and serve them alternate versions of content, or tollgates, automatically.

These are blunt possible solutions. However, they shift power back to the people who are actually creating content. That’s the right direction.

The Bottom Line

Cloudflare’s Pay-Per-Crawl is a clever idea. It’s the first genuine attempt to attach a meter to data before it gets swallowed by the AI engine. And for publishers already using Cloudflare, it’s a step toward asserting control.

But it won’t work at scale.

It fails to distinguish between high-value and low-value content. It relies on the honour system for enforcement. And it assumes that some large AI companies, who have trained billion-dollar models on free web data for years, will suddenly start paying for data.

If anything, Pay-Per-Crawl exposes the more profound truth… This fight is about power.

This war’s just getting started.

I tried 70+ best AI tools.

This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Similar Posts