
Web Scraping in the AI Age: What Actually Changed
May 31, 2026
Web scraping in the AI age broke in two places at once: extraction got easy and access got expensive. LLMs killed the brittle CSS selector, and crawlers exploded while referral traffic flatlined. Here is what actually changed and why it matters.
The selector is dead#
Most scraping tutorials still teach you to hand-write CSS selectors. You inspect the page, find div.product-card > span.price, and pray the site never ships a redesign. It always ships a redesign.
That old approach was brittle by design. One class rename and your parser returns None. I have lost whole afternoons to a site swapping data-testid attributes overnight.
import requests
from bs4 import BeautifulSoup
html = requests.get("https://example.com/product/42").text
soup = BeautifulSoup(html, "html.parser")
# Breaks the moment the markup shifts
name = soup.select_one("div.product-card > h1.title").text
price = soup.select_one("span.price.current").text.strip("$")
stock = soup.select_one("div.availability > span").text
The new way skips selectors entirely. You convert the page to clean markdown, then hand it to a model with a schema. LLM web scraping means you describe what you want, not where it lives in the DOM.
from firecrawl import FirecrawlApp
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
in_stock: bool
app = FirecrawlApp(api_key="fc-...")
# Page-to-markdown, then schema extraction. No selectors.
result = app.scrape_url(
"https://example.com/product/42",
params={"extractorOptions": {"extractionSchema": Product.schema()}},
)
product = Product(**result["llm_extraction"])
The mechanics flipped completely. Here is the shift in one view.
- Targeting: old way pinned to exact DOM paths. New way describes the data semantically.
- Fragility: a redesign broke the old parser. The new one shrugs and re-reads the meaning.
- Output: old returned raw strings to clean by hand. New returns typed objects against your schema.
- Tooling: old leaned on BeautifulSoup and lxml. New leans on page-to-markdown layers like Firecrawl and Jina Reader.
I still reach for selectors on high-volume, stable targets. Tokens cost money, and a model call per page adds up. But for the long tail of messy sites, the selector is done.
The broken bargain#
Scraping used to come with an implied deal. You take my content, you send me visitors. Search worked because Google crawled you, then a human clicked through.
AI broke that bargain. Bots now make up roughly 80% of all web visits, with about 13% from AI bots and 8% from traditional crawlers. The crawl happens, but the click does not.
Look at how lopsided the trade got by July 2025. These are pages crawled per single visitor referred back.
Pages crawled per visitor referred (July 2025)
Google sends a visitor roughly every 5 crawls. Anthropic crawls 38,000 pages per referral. That is not a content deal, that is a one-way pipe.
The purpose shifted too. Of AI crawling in July 2025, 79% was for training, 17% for search, and 3.2% for live user actions. Training was 72% a year earlier.
Individual crawlers got hungrier alongside that. GPTBot's crawler share jumped from 4.7% to 11.7%, and ClaudeBot rose from 6% to 9.9%.
The web fights back#
Publishers noticed the meter running with nothing coming back. So the defaults changed. Cloudflare now lets sites block AI crawlers out of the box and meter the rest.
Their pay-per-crawl system charges bots for access at the edge. Want the content for AI training data? Pay the toll or get a 402, basically robots.txt with a payment processor bolted on.
Note: robots.txt was always voluntary. It is a polite request, not a wall. Enforcement at the network edge is what gives it teeth.
Some crawlers stopped asking politely. Cloudflare de-listed Perplexity as a verified bot on August 4, 2025, accusing it of stealth crawling. The charge: rotating user agents, IPs, and ASNs, plus impersonating a normal Chrome browser to dodge no-crawl directives.
If you run a site and want to control AI bot traffic before it controls you, I wrote a full playbook. See how to block AI crawlers from destroying your site for the practical config.
Worth flagging the irony. Cloudflare sits in front of a huge slice of the web, decrypting and inspecting traffic to do any of this, which I unpacked in why Cloudflare is the biggest man-in-the-middle in history.
The data wall nobody can scrape around#
Here is the part that reframes everything. The supply of human-written text is finite, and we are approaching the bottom.
Researchers estimate the public web holds around 300 trillion tokens of high-quality text. At current training appetites, usable human text could run dry between roughly 2026 and 2028. Call it a range, not a doomsday clock.
That scarcity explains the licensing land-grab. Reddit signed with Google for around $60M a year, then with OpenAI for about $70M a year. When the open web dries up, owned text becomes the asset.
The fallbacks are getting attention for a reason.
- Licensing deals lock up high-signal corpora like Reddit, Stack Overflow, and news archives.
- Synthetic data lets models train on model output, with real risk of quality drift over generations.
- Paywalled and private data becomes the next frontier once the public commons is exhausted.
Is web scraping in the AI age legal#
Everyone asks the same thing: is web scraping legal in this new world? The honest answer is that public does not mean free, and the law is messier than either side admits.
The landmark case is hiQ v LinkedIn, where the Ninth Circuit ruled that scraping public data is not a CFAA violation. Scrapers love to quote that and skip the ending. hiQ lost on contract and terms-of-service grounds, eating a $500K judgment and an injunction.
So scraping public pages may clear the anti-hacking statute and still breach a contract you agreed to. The newer front is copyright. The NYT v OpenAI case moved forward in March 2025, and a November 2025 order let the Times obtain a sample of roughly 20 million ChatGPT logs.
Frequently asked questions
My actionable take: treat scraping as a contract question, not just a technical one. Read the terms, respect robots.txt and pay-per-crawl signals, and prefer licensed or API data when you ship to production. Use ScrapeGraphAI for the easy extraction, but assume the access fight, not the parsing, is now the hard part.