
How to Block AI Crawlers From Destroying Your Site
April 11, 2026
AI crawlers from Meta, OpenAI, and Anthropic are hitting sites millions of times per month. Meta's crawler hit one site 7.9 million times in 30 days, burning 900+ GB of bandwidth before anyone noticed. robots.txt alone does not stop them. This covers the 4 layers that actually work.
The Scale of the Problem#
By early 2025, AI crawlers were generating roughly 50 billion requests per day across sites on Cloudflare's network. That number has only gone up. GPTBot surged from 5% to 30% of AI crawler market share between May 2024 and May 2025, a 305% increase in raw request volume.
The crawl-to-referral ratio tells the real story. Google crawls about 14 times per referral it sends. OpenAI's ratio is 1,700:1. Anthropic's is 73,000:1. These crawlers take everything and send almost nothing back.
AI Crawler Market Share (% of AI bot requests, May 2025)
Here are the user-agent strings you need to know. These are the ones doing the most damage right now.
GPTBot/1.0-- OpenAI's training crawler. 569 million requests/month on Cloudflare's network.ClaudeBotandanthropic-ai-- Anthropic's crawlers. 370 million requests/month.Meta-ExternalAgentandMeta-ExternalFetcher-- Meta's training bots. Emerged fast, now 19% market share.Bytespider-- ByteDance. Scrapes at 25x the speed of GPTBot.CCBot/2.0-- Common Crawl. Used by many AI companies as training data.PerplexityBot-- Perplexity AI search.Google-Extended-- Google's AI training crawler (separate from Googlebot).
robots.txt Is a Suggestion, Not a Wall#
Most guides tell you to add Disallow rules to your robots.txt. That is a starting point, not a solution. robots.txt is a voluntary protocol. Nothing forces a crawler to obey it.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Put that in your robots.txt and you have told every major AI crawler to stay out. GPTBot, ClaudeBot, and Google-Extended officially respect these rules. But Bytespider has been documented ignoring Disallow directives entirely. Others use user-agent spoofing, rotating through real Chrome and Safari strings to look like regular visitors.
Among the top 50 news sites blocking ChatGPT-User, 70.6% still appeared in AI citation datasets. Blocking Google-Extended fared even worse: 92.3% of blocking sites still showed up. The data they already scraped is still in the models.
Warning: Blocking AI crawlers via robots.txt does not affect your Google or Bing search rankings. Googlebot and Bingbot are separate user agents. But it also does not remove your content from models that already ingested it.Cloudflare and Nginx Rules That Actually Block#
If you want enforcement, you need server-side blocking. robots.txt asks nicely. Firewall rules slam the door.
Cloudflare WAF Rules#
Cloudflare shipped a one-click "Block AI Bots" toggle in July 2025 under Security > Bots. Turn it on. It covers verified AI crawlers and bots that try to disguise themselves. But if you want granular control, write a custom WAF rule.
(cf.verified_bot_category eq "AI Crawler") or
(cf.verified_bot_category eq "Aggregator") or
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "Meta-ExternalFetcher") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Google-Extended")
Set the action to Block. The cf.verified_bot_category field catches bots Cloudflare has fingerprinted, even if they rotate user agents. The http.user_agent rules catch the ones that still use honest headers. Together they cover both sides.
Nginx User-Agent Blocking#
If you run your own server, nginx handles this cleanly with a map directive. The 444 status is nginx-specific: it closes the connection without sending a response, saving you bandwidth on every blocked request.
map $http_user_agent $is_ai_bot {
default 0;
~*GPTBot 1;
~*ClaudeBot 1;
~*anthropic-ai 1;
~*Meta-ExternalAgent 1;
~*Meta-ExternalFetcher 1;
~*Bytespider 1;
~*CCBot 1;
~*PerplexityBot 1;
~*Google-Extended 1;
~*Applebot-Extended 1;
}
server {
if ($is_ai_bot) {
return 444;
}
}
For rate limiting instead of outright blocking, use limit_req_zone. This is better if you want some AI crawlers to index you (for AI search visibility) but at a pace your server can absorb.
map $http_user_agent $bot_limit_key {
default "";
~*GPTBot $binary_remote_addr;
~*ClaudeBot $binary_remote_addr;
~*Bytespider $binary_remote_addr;
}
limit_req_zone $bot_limit_key zone=ai_bots:10m rate=5r/m;
server {
location / {
limit_req zone=ai_bots burst=2 nodelay;
}
}
Tip: Usereturn 444instead ofreturn 403in nginx. A 403 still sends response headers and a body. 444 drops the connection immediately, which saves CPU and bandwidth when you are rejecting thousands of bot requests per hour.
The Nuclear Option: Verified Bot Allowlists#
Blocking by user-agent works until crawlers start spoofing. The nuclear option is flipping the model: instead of blocking known bad bots, you only allow known good ones. Everything else gets rejected.
In Cloudflare, create a WAF rule that challenges any request where cf.bot_management.verified_bot is false and the bot score is below 30. This forces unverified automated traffic through a JS challenge that real browsers pass and headless scrapers usually fail.
(not cf.bot_management.verified_bot and cf.bot_management.score lt 30)
Set the action to Managed Challenge. Googlebot, Bingbot, and other verified crawlers pass automatically. Unverified scrapers hit a wall. This approach catches user-agent spoofing, residential proxy rotation, and headless browsers that the simpler rules miss.
On nginx without Cloudflare, you can achieve something similar by validating IPs against published bot IP ranges. Google publishes its IP list as JSON. Cross-reference the requesting IP against it, and reject everything that claims to be a bot but comes from an unverified IP.
Note: The verified-bot-only approach can block legitimate tools like uptime monitors, link checkers, and RSS readers. Test in log-only mode first. Cloudflare's "Managed Challenge" action is safer than "Block" here because it gives real browsers a chance to prove themselves.
Monitoring What's Still Getting Through#
You will never block 100% of AI crawlers. Some rotate through residential proxies, disguise as Chrome 124, and crawl at human-like intervals. The only way to catch them is monitoring.
If you are on nginx, start by tailing your access logs for known bot strings and high-frequency IPs. This one-liner shows the top 20 user agents by request count in the last 24 hours.
awk -F'"' '{print $6}' /var/log/nginx/access.log \
| sort | uniq -c | sort -rn | head -20
If you see a "Chrome" user agent making 50,000 requests in a day from a datacenter IP range, that is not a person. Cross-reference suspicious IPs against ASN databases. Most AI scrapers run from AWS, GCP, or Azure. A residential IP hitting your RSS feed 10 times a minute is also not a person.
For Cloudflare users, the Bot Analytics dashboard (Security > Bots > Analytics) shows bot scores, verified vs unverified traffic, and request volume by bot category. Check it weekly. You will be surprised at what is getting through.
# Find IPs making >1000 requests/day to your site
awk '{print $1}' /var/log/nginx/access.log \
| sort | uniq -c | sort -rn | head -20
# Check if a suspicious IP belongs to a cloud provider
whois 52.14.88.91 | grep -i "orgname\|netname"
I run all four layers on my sites. No single layer is enough. Stack them.
robots.txtfor crawlers that play by the rules.- Cloudflare WAF or nginx
maprules for known bad user agents. - Verified-bot challenges for spoofed traffic and residential proxies.
- Weekly log reviews to catch what slips through.
Common Questions
If you are self-hosting your own stack, you have full control over nginx configs and firewall rules. If you are behind Cloudflare, the managed bot rules plus one custom WAF expression covers 95% of cases. For more on hardening your CLI and server tooling, those guides pair well with this one.