The AI Scraping Problem
Over the past two years, AI training crawlers have become the single largest source of unauthorized content scraping on the web. Where traditional scrapers grabbed product prices or contact data, AI crawlers pull entire articles, original research, how-to guides, and product copy — then feed it into large language model training pipelines without attribution, licensing, or compensation.
The crawlers responsible include GPTBot (OpenAI's training crawler), ClaudeBot (Anthropic), CCBot (Common Crawl — the dataset used by most open-source AI labs), Bytespider (ByteDance, the company behind TikTok), and AhrefsBot when repurposed beyond SEO auditing. These bots are often polite enough to announce themselves via their user agent string — which makes them blockable, if you know where to look.
The problem is scale. Cloudflare's network, which sits in front of roughly 20% of all web traffic, reported in 2024 that AI crawlers were generating traffic equivalent to hundreds of billions of requests per day across its network. For individual site owners, this translates into real bandwidth costs, inflated server load, and content being used to train models that compete with the very businesses that produced the content in the first place.
The good news: Cloudflare has built tools specifically designed to address this. Some of them are free, and they take under 10 minutes to enable.
robots.txt alone is not enough
Many AI crawlers ignore robots.txt directives entirely. According to Cloudflare's bot documentation, a bot that disregards robots.txt is classified as an automated bad bot. You need active blocking at the network layer — not just a text file request.
What Cloudflare AI Audit Does
In late 2023 Cloudflare launched its AI Audit tool — a dashboard feature that shows site owners exactly which AI crawlers are visiting their site, how frequently, and what pages they are hitting. It sits inside the Cloudflare Security dashboard and works on all plans, including the free tier.
AI Audit goes beyond raw traffic counts. It classifies each bot by its stated purpose (training data collection vs SEO indexing vs price monitoring), shows you the crawler's verified operator (OpenAI, Anthropic, Common Crawl, etc.), and lets you block or allow each one individually with a single click. The underlying bot detection uses Cloudflare's global threat intelligence combined with user-agent verification against known bot operator IP ranges.
Bot Fight Mode (Free)
FreeHeuristic-based detection that blocks simple scrapers, credential stuffers, and many AI crawlers. One toggle. No configuration required. Available on all Cloudflare plans.
Bot Management (Enterprise)
PaidMachine-learning bot scoring per request, Workers integration, detailed traffic attribution, and fine-grained allow/block rules per bot category. Overkill for most sites.
For the majority of MevoHost customers — bloggers, agency sites, small business owners, and SaaS landing pages — Bot Fight Mode plus a handful of custom WAF rules covers 95% of AI scraping activity. You do not need Enterprise Bot Management unless you are running a large content platform with complex bot-allow requirements (e.g., allowing specific data partners while blocking all others).
Check AI Audit before you block
Before enabling Bot Fight Mode, spend two minutes in the Cloudflare AI Audit dashboard. It will show you which AI crawlers are already hitting your site and how much traffic they represent. This helps you prioritize which custom WAF rules to create first.
How to Enable Bot Protection (Free)
Cloudflare's free Bot Fight Mode takes about 60 seconds to enable. Here is the exact path through the dashboard:
Log in to your Cloudflare dashboard
- 1Go to dash.cloudflare.com and sign in.
- 2Select the domain you want to protect from the account home screen.
Open Security > Bots
- 1In the left sidebar, click Security.
- 2In the Security submenu, click Bots.
- 3You will see the Bot Fight Mode panel at the top of the page.
Enable Bot Fight Mode
- 1Toggle Bot Fight Mode to On.
- 2No additional configuration is required — it activates immediately.
- 3Cloudflare will begin blocking requests from known bad bots at the network edge before they reach your server.
Review AI Audit (Pro plan and above)
- 1If you are on a Cloudflare Pro plan or higher, scroll down to the AI Audit section.
- 2You will see a breakdown of AI crawlers by operator — click Block next to any crawler you want to stop.
- 3On the free plan, this granular view is not available, but Bot Fight Mode + WAF rules (covered in the next section) give you equivalent protection.
That's it for the baseline. Bot Fight Mode is now active. It will challenge or block requests that match Cloudflare's heuristic bad-bot signatures, including many AI crawlers. But some crawlers — particularly those operated by large, well-funded AI companies that rotate IPs and mimic legitimate browser behavior — will still get through. That is where custom WAF rules come in.
Block Specific AI Scrapers with Custom WAF Rules
Custom Cloudflare WAF rules let you match on specific user agent strings and apply a Block action before the request ever touches your origin server. Because AI crawlers identify themselves in the User-Agent header, this approach is highly effective against crawlers that follow HTTP conventions — even if they ignore robots.txt.
Here is how to create a WAF rule that blocks the five most active AI training crawlers in a single rule:
Navigate to Security > WAF > Custom Rules
- 1In your Cloudflare dashboard, go to Security > WAF.
- 2Click the Custom Rules tab.
- 3Click Create Rule.
Name your rule
- 1Give the rule a clear name, e.g. "Block AI Training Crawlers".
- 2This name appears in your WAF event log so you can track triggered blocks.
Build the expression
- 1Switch to the Expression Editor (click "Edit expression" if you see the visual builder).
- 2Paste the expression from the code block below exactly as shown.
- 3Click Save.
(http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "CCBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "AhrefsBot")
Set the action to Block. Cloudflare will return a 403 Forbidden response to any request carrying one of these user agent strings. The rule runs at the edge — your origin server never receives the request.
| Bot Name | Operator | Purpose | Respects robots.txt? |
|---|---|---|---|
| GPTBot | OpenAI | LLM training data | Usually |
| ClaudeBot | Anthropic | LLM training data | Usually |
| CCBot | Common Crawl | Open training dataset | Sometimes |
| Bytespider | ByteDance | AI product training | Inconsistent |
| AhrefsBot | Ahrefs | SEO + scraping | Usually |
Do not block Googlebot or Bingbot
Googlebot and Bingbot are search indexing crawlers, not AI training scrapers. Their user agent strings are Googlebot and bingbot — none of the strings in the WAF rule above will match them. Your search rankings are safe.
Update Your robots.txt
Your Cloudflare WAF rule is the enforcement layer. Your robots.txt file is the declaration layer — a statement of intent that polite crawlers respect and that courts and regulators increasingly reference when determining whether scraping constituted unauthorized access.
Add the following directives to the robots.txt file in your site's root directory (for WordPress sites, this is usually at /public_html/robots.txt):
# Block OpenAI training crawler User-agent: GPTBot Disallow: / # Block Anthropic training crawler User-agent: ClaudeBot Disallow: / # Block Common Crawl (used by many AI labs) User-agent: CCBot Disallow: / # Block ByteDance / TikTok AI crawler User-agent: Bytespider Disallow: / # Block Ahrefs crawler from training scrapes User-agent: AhrefsBot Disallow: / # Block Google's AI training crawler (separate from Googlebot) User-agent: Google-Extended Disallow: / # Allow all legitimate search engine crawlers User-agent: * Allow: /
Note the Google-Extended entry. This is Google's dedicated user agent for training its Gemini AI models — completely separate from Googlebot, which handles search indexing. Blocking Google-Extended does not affect your Google search rankings — those are handled by Googlebot, which is explicitly allowed by the User-agent: * rule at the bottom.
Belt-and-braces approach: Use both Cloudflare WAF rules AND robots.txt. The WAF rule blocks crawlers at the network edge — they never hit your server. The robots.txt entry creates a documented declaration that these crawlers were explicitly forbidden. Together, they cover both the technical and legal bases.
Monitor & Review Bot Traffic
Enabling Bot Fight Mode and adding WAF rules is not a set-and-forget task. AI crawlers evolve. New ones appear. Existing ones change their user agent strings. You need to check your Cloudflare analytics at least monthly to verify your rules are working and catch any new scrapers that have appeared.
How to read bot traffic in Cloudflare Analytics
When to update your WAF rules
If you see a new user agent string appearing in your Security Events log that looks like an AI crawler but is not yet covered by your rules, add it. The pattern is simple: if the user agent contains a recognizable AI or scraping product name, or if the IP range belongs to a known AI company (OpenAI publishes its crawler IP ranges in its documentation), add a new WAF rule to match it.
One toggle in Security > Bots. No configuration. Immediate effect.
One custom rule with five user agent matches. Copy-paste the expression above.
Check Security Events once a month for new scrapers that need new WAF rules.
Your Site. Your Content.
All MevoHost plans include Cloudflare integration out of the box. Add your domain to Cloudflare for free, enable Bot Fight Mode in 60 seconds, and stop AI scrapers before they hit your server.
FAQ
Is Cloudflare Bot Fight Mode free?
Yes. Cloudflare's Bot Fight Mode is available on the free plan. It provides heuristic-based detection that stops simple scraper bots, credential stuffers, and many AI crawlers with a single toggle. Paid plans (Pro and above) unlock Bot Analytics, which shows you detailed bot score breakdowns and traffic attribution.
Does Cloudflare Bot Fight Mode block GPTBot?
Cloudflare's Bot Fight Mode does not guarantee blocking all GPTBot traffic — OpenAI's crawler is classified as a "verified bot" by Cloudflare because it respects robots.txt. To definitively block GPTBot, you need a custom WAF rule targeting the user agent string "GPTBot". Combine the WAF rule with a robots.txt Disallow directive for belt-and-braces coverage.
What is the difference between Cloudflare Bot Fight Mode and Bot Management?
Bot Fight Mode is the free, heuristic-based layer that blocks clearly automated traffic. Bot Management (available on Enterprise plans) adds machine-learning bot scoring, per-request bot scores via Workers, and detailed traffic attribution for every request. For most websites, Bot Fight Mode plus custom WAF rules is sufficient. Only large-scale platforms with complex bot-traffic patterns need full Bot Management.
Will blocking AI bots hurt my SEO?
No. Googlebot and Bingbot are not AI training crawlers — they are search indexing crawlers and are classified separately. Cloudflare's Bot Fight Mode does not block verified SEO crawlers. Your custom WAF rules target specific AI scraper user agents (GPTBot, ClaudeBot, CCBot, Bytespider) and not generic crawler strings. Googlebot's user agent is "Googlebot", not any of the above, so it is unaffected.
What AI bots should I block in robots.txt?
The most active AI training crawlers as of 2026 are: GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl, used by many AI labs), Bytespider (ByteDance/TikTok), Google-Extended (Google Gemini training), and AhrefsBot when used for scraping rather than SEO audits. Add a Disallow: / directive for each in robots.txt. Note that robots.txt is advisory — crawlers that ignore it need a Cloudflare WAF rule to actually stop them.
Sarah Kim
SEO & Security Specialist at MevoHost
Sarah has spent 8+ years at the intersection of technical SEO and web security, helping site owners protect their content from scrapers while maintaining the indexability signals that drive organic traffic. At MevoHost, she covers bot protection, Cloudflare configuration, and the evolving landscape of AI content scraping — translating Cloudflare's documentation into practical steps for non-enterprise site owners.