How AI Agents Fetch Pages — and Where the Pipeline Breaks
Last reviewed:
Raw versus rendered is familiar SEO ground. The newer failure is payload depth: the model may never receive enough of the page to use it well.
What changed
SEO practitioners already know raw HTML can differ from the rendered page — that’s not the new problem. The newer risk starts after fetch and, where available, after render: an AI system may convert the page into text or Markdown, trim it, cache it, rank passages from it, or pass only a shortened extraction payload into the model.
The working question is no longer only “is the content in source or in the DOM?” It’s “does the content survive into the payload the model actually reads?” A URL has at least three practical surfaces: the raw HTML returned by the server, the rendered DOM after JavaScript runs, and the extraction payload produced by the retrieval layer. Authors usually design the rendered page. AI systems often operate on something smaller.
- Raw HTML: the first response returned by the server
- Rendered DOM: the page after JavaScript runs and the browser builds the interface
- Extraction payload: the shortened, processed version the retrieval layer passes to the model
Search crawlers, AI training crawlers, and user-triggered AI fetchers are also now separate access paths. OpenAI documents GPTBot for training, OAI-SearchBot for ChatGPT search surfacing, and ChatGPT-User for user-initiated page visits. Anthropic and Perplexity publish similar splits between automated crawling and user-directed retrieval. Each path can receive, process, and keep a different subset of the same URL.
Google remains the important exception to keep straight. Googlebot can render JavaScript with a headless Chromium pipeline, and Google uses rendered HTML for indexing when rendering succeeds. Applebot may render too. Field data from Vercel and MERJ (published January 2025, drawn from hundreds of millions of crawler fetches across Vercel’s network), however, found zero JavaScript execution from the OpenAI, Anthropic, Meta, ByteDance, and Perplexity crawlers — they fetch JavaScript files but never run them. That doesn’t prove every future fetch will behave the same way — it does prove that “Google can render it” is no longer a sufficient AI visibility test.
Why it matters
Raw-source checks and rendered-DOM checks can both pass while the AI-facing payload still fails. A crawler can fetch the page, a renderer can build the interface, and the model can still receive a partial, reordered, or boilerplate-heavy version of the document.
This affects training, search results, and live webpage gathering differently. If a training crawler receives a JavaScript shell with no body copy, the training pipeline has little useful content from that URL. If a search crawler receives the same shell, the page may be eligible in policy terms but weak in retrieval terms. If a user-triggered fetcher receives a cookie wall, WAF challenge, or consent interstitial, the model may read the obstacle instead of the article. The author intended the browser experience; the system used the fetch experience.
The second failure is quieter: even a rendered page is not a promise that the whole rendered page reached the model. Most AI platforms don’t publish fixed character limits, wait times, pruning rules, cache rules, or DOM serialization behavior for their fetch tools. Some retrieve from a search index, some visit the page at query time, some convert HTML to text or Markdown before the model sees it. Large inline CSS blocks, repeated navigation, oversized headers, tabbed panels, and late document order can all consume the useful budget before the important claim appears.
That’s the practical break: a page can look correct, pass a browser review, rank in traditional search, and still be a poor source for AI systems because the relevant text was absent, buried, gated, or trimmed before the model read it.
What’s still true
- Raw HTML is the safest common denominator. Put the title, canonical, robots directives, primary heading, main answer, important internal links, publication dates, and JSON-LD structured data in the first response whenever the page needs to be found, cited, or reused.
- Server-side rendering, static site generation, and partial hydration are still the reliable fixes for critical content. Client-side JavaScript is fine for interface enhancement; it’s fragile as the only carrier of crawl-worthy text.
- A successful Google render does not prove successful OpenAI, Anthropic, Perplexity, or Common Crawl ingestion — these are different systems with different cost models and different product incentives.
- Robots controls decide access, not comprehension. Allowing
OAI-SearchBot,ClaudeBot, orPerplexityBotdoes not make JavaScript-only content readable. - Infrastructure still wins before policy. CDN rules, bot protection, rate limits, geographic blocks, and session requirements can override a permissive robots.txt by serving a different response.
- Document order matters. A sidebar that appears visually on the right can still serialize before the article; a long navigation block can push the first useful sentence thousands of characters down the payload.
- Long tabbed pages are an information-architecture decision, not just an interface pattern. When each tab carries standalone retrieval value, separate crawlable URLs usually give agents a cleaner payload than one crowded parent page.
What to do now
Compare the surfaces
- Fetch the raw response with
curl -s -L https://example.com/pageand confirm the main content appears without JavaScript. - Repeat the test without cookies and behind the same CDN or WAF rules real crawlers hit — a clean logged-in browser session is not evidence of crawler access.
- Compare raw source, rendered DOM, and text extraction; see Audit Content Extraction for that audit in full.
Make the first response useful
- Serve the core answer, article body, product facts, citations, internal links, and JSON-LD in HTML before client-side hydration.
- Remove avoidable inline CSS and script bloat above the content — every repeated block before the article competes with the text you want extracted.
- Keep key claims near the start of the document; don’t make the only concise answer a final paragraph after navigation, related links, ads, or tab variants.
Route long tab sets as real pages
- Break long tabbed content into separate URLs when each tab has enough standalone value to rank, be cited, be shared, or be audited independently.
- Keep the tab interface if it helps users, but make each tab a real
<a href>route with server-rendered content, not a button-only JavaScript panel. - Don’t split small UI conveniences into thin near-duplicate pages — short filters, pricing toggles, and comparison panels can stay on one URL when the content only makes sense in the parent context.
Separate the pipelines
- Map training, search indexing, and user-triggered fetches separately for each platform — one robots decision rarely controls all three.
- Verify actual visits in server logs against published IP ranges where the platform provides them — see Parse Log Files for AI Bot Behavior; policy intent is not the same as network reality.
- Treat “not cited” as an access and extraction problem first, not a vague content-quality problem — quality matters after the system can actually read the page.
Training-pipeline access and retrieval-pipeline access are controlled separately; see LLM Training vs. Real-Time Indexing for how to tell them apart before diagnosing a fetch failure.