How to Block AI Crawlers From Destroying Your Site

SecurityDevOpsWeb DevNginx

AI crawlers from Meta, OpenAI, and Anthropic are hitting sites millions of times per month. Meta's crawler hit one site 7.9 million times in 30 days, burning 900+ GB of bandwidth before anyone noticed. robots.txt alone does not stop them. This covers the 4 layers that actually work.

The Scale of the Problem#

By early 2025, AI crawlers were generating roughly 50 billion requests per day across sites on Cloudflare's network. That number has only gone up. GPTBot surged from 5% to 30% of AI crawler market share between May 2024 and May 2025, a 305% increase in raw request volume.

The crawl-to-referral ratio tells the real story. Google crawls about 14 times per referral it sends. OpenAI's ratio is 1,700:1. Anthropic's is 73,000:1. These crawlers take everything and send almost nothing back.

AI Crawler Market Share (% of AI bot requests, May 2025)

Here are the user-agent strings you need to know. These are the ones doing the most damage right now.

GPTBot/1.0 -- OpenAI's training crawler. 569 million requests/month on Cloudflare's network.
ClaudeBot and anthropic-ai -- Anthropic's crawlers. 370 million requests/month.
Meta-ExternalAgent and Meta-ExternalFetcher -- Meta's training bots. Emerged fast, now 19% market share.
Bytespider -- ByteDance. Scrapes at 25x the speed of GPTBot.
CCBot/2.0 -- Common Crawl. Used by many AI companies as training data.
PerplexityBot -- Perplexity AI search.
Google-Extended -- Google's AI training crawler (separate from Googlebot).

robots.txt Is a Suggestion, Not a Wall#

Most guides tell you to add Disallow rules to your robots.txt. That is a starting point, not a solution. robots.txt is a voluntary protocol. Nothing forces a crawler to obey it.

robots.txt

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Put that in your robots.txt and you have told every major AI crawler to stay out. GPTBot, ClaudeBot, and Google-Extended officially respect these rules. But Bytespider has been documented ignoring Disallow directives entirely. Others use user-agent spoofing, rotating through real Chrome and Safari strings to look like regular visitors.

Among the top 50 news sites blocking ChatGPT-User, 70.6% still appeared in AI citation datasets. Blocking Google-Extended fared even worse: 92.3% of blocking sites still showed up. The data they already scraped is still in the models.

Warning: Blocking AI crawlers via robots.txt does not affect your Google or Bing search rankings. Googlebot and Bingbot are separate user agents. But it also does not remove your content from models that already ingested it.

Cloudflare and Nginx Rules That Actually Block#

If you want enforcement, you need server-side blocking. robots.txt asks nicely. Firewall rules slam the door.

Cloudflare WAF Rules#

Cloudflare shipped a one-click "Block AI Bots" toggle in July 2025 under Security > Bots. Turn it on. It covers verified AI crawlers and bots that try to disguise themselves. But if you want granular control, write a custom WAF rule.

Cloudflare WAF Custom Rule (Expression)

(cf.verified_bot_category eq "AI Crawler") or
(cf.verified_bot_category eq "Aggregator") or
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "Meta-ExternalFetcher") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Google-Extended")

Set the action to Block. The cf.verified_bot_category field catches bots Cloudflare has fingerprinted, even if they rotate user agents. The http.user_agent rules catch the ones that still use honest headers. Together they cover both sides.

Nginx User-Agent Blocking#

If you run your own server, nginx handles this cleanly with a map directive. The 444 status is nginx-specific: it closes the connection without sending a response, saving you bandwidth on every blocked request.

nginx.conf

map $http_user_agent $is_ai_bot {
    default 0;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*anthropic-ai 1;
    ~*Meta-ExternalAgent 1;
    ~*Meta-ExternalFetcher 1;
    ~*Bytespider 1;
    ~*CCBot 1;
    ~*PerplexityBot 1;
    ~*Google-Extended 1;
    ~*Applebot-Extended 1;
}

server {
    if ($is_ai_bot) {
        return 444;
    }
}

For rate limiting instead of outright blocking, use limit_req_zone. This is better if you want some AI crawlers to index you (for AI search visibility) but at a pace your server can absorb.

nginx.conf (rate limiting)

map $http_user_agent $bot_limit_key {
    default "";
    ~*GPTBot $binary_remote_addr;
    ~*ClaudeBot $binary_remote_addr;
    ~*Bytespider $binary_remote_addr;
}

limit_req_zone $bot_limit_key zone=ai_bots:10m rate=5r/m;

server {
    location / {
        limit_req zone=ai_bots burst=2 nodelay;
    }
}

Tip: Use return 444 instead of return 403 in nginx. A 403 still sends response headers and a body. 444 drops the connection immediately, which saves CPU and bandwidth when you are rejecting thousands of bot requests per hour.

The Nuclear Option: Verified Bot Allowlists#

Blocking by user-agent works until crawlers start spoofing. The nuclear option is flipping the model: instead of blocking known bad bots, you only allow known good ones. Everything else gets rejected.

In Cloudflare, create a WAF rule that challenges any request where cf.bot_management.verified_bot is false and the bot score is below 30. This forces unverified automated traffic through a JS challenge that real browsers pass and headless scrapers usually fail.

Cloudflare WAF (Verified Bot Allowlist)

(not cf.bot_management.verified_bot and cf.bot_management.score lt 30)

Set the action to Managed Challenge. Googlebot, Bingbot, and other verified crawlers pass automatically. Unverified scrapers hit a wall. This approach catches user-agent spoofing, residential proxy rotation, and headless browsers that the simpler rules miss.

On nginx without Cloudflare, you can achieve something similar by validating IPs against published bot IP ranges. Google publishes its IP list as JSON. Cross-reference the requesting IP against it, and reject everything that claims to be a bot but comes from an unverified IP.

Note: The verified-bot-only approach can block legitimate tools like uptime monitors, link checkers, and RSS readers. Test in log-only mode first. Cloudflare's "Managed Challenge" action is safer than "Block" here because it gives real browsers a chance to prove themselves.

Monitoring What's Still Getting Through#

You will never block 100% of AI crawlers. Some rotate through residential proxies, disguise as Chrome 124, and crawl at human-like intervals. The only way to catch them is monitoring.

If you are on nginx, start by tailing your access logs for known bot strings and high-frequency IPs. This one-liner shows the top 20 user agents by request count in the last 24 hours.

terminal

awk -F'"' '{print $6}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -20

If you see a "Chrome" user agent making 50,000 requests in a day from a datacenter IP range, that is not a person. Cross-reference suspicious IPs against ASN databases. Most AI scrapers run from AWS, GCP, or Azure. A residential IP hitting your RSS feed 10 times a minute is also not a person.

For Cloudflare users, the Bot Analytics dashboard (Security > Bots > Analytics) shows bot scores, verified vs unverified traffic, and request volume by bot category. Check it weekly. You will be surprised at what is getting through.

terminal

# Find IPs making >1000 requests/day to your site
awk '{print $1}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -20

# Check if a suspicious IP belongs to a cloud provider
whois 52.14.88.91 | grep -i "orgname\|netname"

I run all four layers on my sites. No single layer is enough. Stack them.

robots.txt for crawlers that play by the rules.
Cloudflare WAF or nginx map rules for known bad user agents.
Verified-bot challenges for spoofed traffic and residential proxies.
Weekly log reviews to catch what slips through.

Common Questions

No. Googlebot and Google-Extended are separate user agents. Blocking Google-Extended only stops Google from using your content for AI training (Gemini). Your search rankings are driven by Googlebot, which you should never block. Blocking GPTBot, ClaudeBot, or Meta-ExternalAgent has zero effect on search indexing.

It depends on whether you want AI search visibility. Blocking eliminates the traffic entirely. Rate limiting (e.g., 5 requests per minute) lets AI search products like ChatGPT Search or Perplexity index your content without overwhelming your server. If you care about appearing in AI search results, rate-limit the search bots and block the training-only bots.

Yes, if they already scraped it before you added the block. Training data is ingested in bulk. Your robots.txt block prevents future crawling, not retroactive removal. Some companies offer opt-out forms for removing existing training data, but compliance is voluntary and slow. The legal landscape around this is still evolving in 2026.

If you are self-hosting your own stack, you have full control over nginx configs and firewall rules. If you are behind Cloudflare, the managed bot rules plus one custom WAF expression covers 95% of cases. For more on hardening your CLI and server tooling, those guides pair well with this one.