Audit Content Extraction
Last reviewed:
A page can look perfect in a browser and still fail in indexing, extraction, or snippet quality — because the browser is the least interesting reader your page has. This audit answers what a crawler or AI system actually receives, in two questions asked in order:
- Presence — does the answer-relevant content survive the fetch-and-render pipeline at all, or does it depend on JavaScript, APIs, or interaction behaving perfectly?
- Legibility — once the content is present, is it structured so a retrieval system extracts the right claim, rather than the first sentence it happens to find?
Skipping straight to legibility is the common mistake: no amount of heading discipline helps a claim that never made it into the fetched HTML. Preconditions: a priority URL or template set; access to raw source inspection and a headless render tool or Search Console URL Inspection; and direct query access to ChatGPT, Bing Copilot, and Perplexity for extraction testing.
Select priority URLs and baseline the current state
Rendering and extraction audits are expensive — run them where a divergence would actually cost visibility, and establish the floor before you change anything.
- Start with revenue pages, citation targets, or templates already showing weak indexing or snippet behavior; test representative URLs per template, not one lucky page, and include a well-performing page as a control if the set allows
- Prompt ChatGPT, Bing Copilot, and Perplexity with the exact query each page targets — record which claims are extracted, which are omitted, and what each platform attributes, verbatim
- Note whether the failure shows in Google only, AI systems only, or both — the major AI crawlers do not execute JavaScript (see How AI Agents Fetch Pages), so an AI-only failure points at raw HTML
- Tag each page with its primary claim: the single most important fact a practitioner should be able to extract
Capture raw HTML before anything executes
The raw source is the first truth surface — it is what a non-rendering fetch receives.
- Save the HTML exactly as the server sends it under a crawler user agent
- Check whether title, meta robots, canonical, structured data, H1, and primary body copy exist in that source
- Look for placeholder shells, empty containers, or dependency markers where content should be
- If the answer-relevant copy is absent in source, write that down before you open a browser tab and talk yourself out of it
Capture rendered output and its network dependencies
A rendered page is only useful if you know what had to go right for it to exist.
- Render in a headless tool and export the final DOM (or a screenshot) plus network logs
- Note blocked resources, failed API calls, timing issues, and deferred components
- Check whether tabs, accordions, carousels, or lazy-load hide critical copy until interaction
- Compare a real browser render against the headless render if the page depends on fragile client conditions
Compare critical signals across both surfaces
Signal divergence between source and rendered DOM is often the root cause, not a footnote — search systems do not reward contradictory instructions.
- Compare title, canonical, meta robots, hreflang, and structured headings between source and rendered output
- Check whether any framework rewrites canonical or noindex after load
- Verify structured data appears once, correctly, and survives rendering rather than duplicating or disappearing
Decision point: signals differ between source and final DOM → fix the signal conflict first; you cannot diagnose a content gap while the page is issuing contradictory directives.
Compare the main content and extraction surface
Visibility problems are more often missing paragraphs, links, or tables than broken tags.
- Confirm the page’s primary answer, summary, tables, and internal links exist in raw HTML as well as in rendered output
- Compare paragraph order — some frameworks move useful copy below generic interface chrome after hydration
- Flag any fact that lives only inside an image, widget, or API-loaded module
Decision point: critical content missing in raw HTML but present after render → treat it as a genuine rendering risk and move the content earlier in the pipeline. Rendered output is also incomplete → the problem is an application or dependency failure, not just crawler behavior.
Trace the dependency causing any presence gap
“JavaScript issue” is not a diagnosis.
- Identify whether the missing content depends on a client-side fetch, hydration step, consent gate, geo decision, or viewport interaction
- Check whether blocked scripts, CSP rules, bot defenses, or API throttling affect the rendered result
- Look for race conditions where the DOM briefly holds one value and finishes with another
- Record the specific component or resource owner so the fix lands in the right system
Audit heading structure and claim positioning
Presence confirmed, the audit shifts to legibility. Retrieval systems select the opening of a passage, not the best part of it.
- Every high-value page needs an H1 that states the topic and H2/H3 subheadings that make discrete, citable claims — not vague labels like “Overview” or “More Information”
- The key fact must appear in the first 50–100 words of the section where it lives, not after contextual preamble
- Use URL Inspection to compare crawled page text against expected content — rendering can reposition or remove text after crawl
Audit list and table formatting
Structure is the extraction primitive for retrieval systems.
- Rewrite prose paragraphs that carry multiple distinct facts as bulleted lists, one claim per bullet
- Give tables explicit row and column headers — headerless tables are treated as opaque blocks
- Flatten nested lists deeper than two levels; retrieval systems frequently collapse or drop them
- Begin each list item with the claim, not a qualifier: “Direct match wins over broad match in the same ad group,” not “It’s important to note that…”
Validate schema alignment with visible content
Schema that contradicts visible text is a trust failure for indexers and AI systems alike.
- Confirm
name,description, anddateModifiedin page schema match the visible heading, introduction, and publication date exactly - FAQ schema questions must match queries users actually ask — keyword-stuffed FAQ schema is processed as low-confidence data
- If an
Articleschemaauthorreferences a person, that person’sPersonschema must be present on the linked author page with a matchingsameAs - Run a rich-result validator after schema changes to confirm no parsing errors before declaring the fix complete
Fix at the lowest-risk layer, presence before legibility
Move essential content earlier in the pipeline rather than hoping crawlers wait for it forever — and apply legibility fixes in order of return.
- Presence: server-render critical copy, links, and SEO signals; promote answer text out of tabs, accordions, or API-only widgets; remove duplicate client-side rewrites of canonical, robots, or schema when the server already sets them
- Legibility: reposition key claims to the opening of their section first (the highest-return change), then convert multi-fact prose to lists, then add claim-first H2s before preamble paragraphs
- Do not mistake a performance optimization for a visibility fix unless the missing-content problem is actually resolved, and do not rewrite accurate content to sound like AI output — the goal is structural clarity, not tonal mimicry
Decision point: hallucination (a claim not on the page) → add the claim explicitly first; structural fixes cannot surface content that does not exist. Truncation (claim present but cut off) → it is not positioned early enough; move it to the first sentence of its section. Platform attributes the claim to the wrong domain →
sameAsand entity schema are missing or wrong; fix entity schema first. No extraction after 60+ days in index → check crawl access and rendered output before any formatting work.
Validate with crawler-oriented tools and live extraction
The browser is the least interesting validator for this job.
- Recheck raw HTML and rendered DOM after the fix using the same methods as the baseline
- Run URL Inspection (or another crawl-render tool) on representative URLs, and confirm crawlable links, schema, and answer copy are stable across multiple fetches
- Re-prompt ChatGPT, Bing Copilot, and Perplexity with the target query and compare against the baseline — test extraction, not just presence; if extraction was already accurate at baseline, leave the format alone
- Partial stability is not stability: if only one environment is fixed, keep debugging
Monitor for recurrence by template
Rendering regressions and extraction drift both come back quietly.
- Re-test extraction manually 7–14 days after changes — do not assess before re-indexing has propagated; request a re-crawl via URL Inspection if a stale cached version persists
- Spot-check the template set after deployments, framework upgrades, or caching changes, and track which components have a history of hiding or rewriting content
- Set a quarterly review for high-value pages — citation accuracy drifts as platform behavior changes independent of your content
- Keep test results in a version-controlled doc; without a record, recurrence cannot be told apart from a first occurrence
Watch for these failure modes
- Checking only a screenshot and assuming crawlers saw the same thing
- Testing one homepage or hero template while the real failures live deeper in the site
- Optimizing format on a crawl-blocked page — no structural improvement helps an inaccessible page; confirm access first
- Running extraction tests within hours of a change — no indexing pathway is that fast; allow at least 7 days
- Converting all prose to bullets without preserving precision — over-bulleted pages produce shorter, less accurate extractions when claims lose essential context
- Letting client-side code rewrite canonical, robots, or schema on every route change
- Treating a single AI platform’s behavior as representative — each has a different training cutoff and retrieval logic; test ChatGPT, Bing, and Perplexity independently