Machine-Readable Infrastructure

Audit Content Extraction

Last reviewed: July 3, 2026

A page can look perfect in a browser and still fail in indexing, extraction, or snippet quality — because the browser is the least interesting reader your page has. This audit answers what a crawler or AI system actually receives, in two questions asked in order:

Presence — does the answer-relevant content survive the fetch-and-render pipeline at all, or does it depend on JavaScript, APIs, or interaction behaving perfectly?
Legibility — once the content is present, is it structured so a retrieval system extracts the right claim, rather than the first sentence it happens to find?

Skipping straight to legibility is the common mistake: no amount of heading discipline helps a claim that never made it into the fetched HTML. Preconditions: a priority URL or template set; access to raw source inspection and a headless render tool or Search Console URL Inspection; and direct query access to ChatGPT, Bing Copilot, and Perplexity for extraction testing.

Select priority URLs and baseline the current state

Rendering and extraction audits are expensive — run them where a divergence would actually cost visibility, and establish the floor before you change anything.

Start with revenue pages, citation targets, or templates already showing weak indexing or snippet behavior; test representative URLs per template, not one lucky page, and include a well-performing page as a control if the set allows
Prompt ChatGPT, Bing Copilot, and Perplexity with the exact query each page targets — record which claims are extracted, which are omitted, and what each platform attributes, verbatim
Note whether the failure shows in Google only, AI systems only, or both — the major AI crawlers do not execute JavaScript (see How AI Agents Fetch Pages), so an AI-only failure points at raw HTML
Tag each page with its primary claim: the single most important fact a practitioner should be able to extract

Capture raw HTML before anything executes

The raw source is the first truth surface — it is what a non-rendering fetch receives.

Save the HTML exactly as the server sends it under a crawler user agent
Check whether title, meta robots, canonical, structured data, H1, and primary body copy exist in that source
Look for placeholder shells, empty containers, or dependency markers where content should be
If the answer-relevant copy is absent in source, write that down before you open a browser tab and talk yourself out of it

Capture rendered output and its network dependencies

A rendered page is only useful if you know what had to go right for it to exist.

Render in a headless tool and export the final DOM (or a screenshot) plus network logs
Note blocked resources, failed API calls, timing issues, and deferred components
Check whether tabs, accordions, carousels, or lazy-load hide critical copy until interaction
Compare a real browser render against the headless render if the page depends on fragile client conditions

Compare critical signals across both surfaces

Signal divergence between source and rendered DOM is often the root cause, not a footnote — search systems do not reward contradictory instructions.

Compare title, canonical, meta robots, hreflang, and structured headings between source and rendered output
Check whether any framework rewrites canonical or noindex after load
Verify structured data appears once, correctly, and survives rendering rather than duplicating or disappearing

Decision point: signals differ between source and final DOM → fix the signal conflict first; you cannot diagnose a content gap while the page is issuing contradictory directives.

Compare the main content and extraction surface

Visibility problems are more often missing paragraphs, links, or tables than broken tags.

Confirm the page’s primary answer, summary, tables, and internal links exist in raw HTML as well as in rendered output
Compare paragraph order — some frameworks move useful copy below generic interface chrome after hydration
Flag any fact that lives only inside an image, widget, or API-loaded module

Decision point: critical content missing in raw HTML but present after render → treat it as a genuine rendering risk and move the content earlier in the pipeline. Rendered output is also incomplete → the problem is an application or dependency failure, not just crawler behavior.

Trace the dependency causing any presence gap

“JavaScript issue” is not a diagnosis.

Identify whether the missing content depends on a client-side fetch, hydration step, consent gate, geo decision, or viewport interaction
Check whether blocked scripts, CSP rules, bot defenses, or API throttling affect the rendered result
Look for race conditions where the DOM briefly holds one value and finishes with another
Record the specific component or resource owner so the fix lands in the right system

Audit heading structure and claim positioning

Presence confirmed, the audit shifts to legibility. Retrieval systems select the opening of a passage, not the best part of it.

Every high-value page needs an H1 that states the topic and H2/H3 subheadings that make discrete, citable claims — not vague labels like “Overview” or “More Information”
The key fact must appear in the first 50–100 words of the section where it lives, not after contextual preamble
Use URL Inspection to compare crawled page text against expected content — rendering can reposition or remove text after crawl

Audit list and table formatting

Structure is the extraction primitive for retrieval systems.

Rewrite prose paragraphs that carry multiple distinct facts as bulleted lists, one claim per bullet
Give tables explicit row and column headers — headerless tables are treated as opaque blocks
Flatten nested lists deeper than two levels; retrieval systems frequently collapse or drop them
Begin each list item with the claim, not a qualifier: “Direct match wins over broad match in the same ad group,” not “It’s important to note that…”

Validate schema alignment with visible content

Schema that contradicts visible text is a trust failure for indexers and AI systems alike.

Confirm name, description, and dateModified in page schema match the visible heading, introduction, and publication date exactly
FAQ schema questions must match queries users actually ask — keyword-stuffed FAQ schema is processed as low-confidence data
If an Article schema author references a person, that person’s Person schema must be present on the linked author page with a matching sameAs
Run a rich-result validator after schema changes to confirm no parsing errors before declaring the fix complete

Fix at the lowest-risk layer, presence before legibility

Move essential content earlier in the pipeline rather than hoping crawlers wait for it forever — and apply legibility fixes in order of return.

Presence: server-render critical copy, links, and SEO signals; promote answer text out of tabs, accordions, or API-only widgets; remove duplicate client-side rewrites of canonical, robots, or schema when the server already sets them
Legibility: reposition key claims to the opening of their section first (the highest-return change), then convert multi-fact prose to lists, then add claim-first H2s before preamble paragraphs
Do not mistake a performance optimization for a visibility fix unless the missing-content problem is actually resolved, and do not rewrite accurate content to sound like AI output — the goal is structural clarity, not tonal mimicry

Decision point: hallucination (a claim not on the page) → add the claim explicitly first; structural fixes cannot surface content that does not exist. Truncation (claim present but cut off) → it is not positioned early enough; move it to the first sentence of its section. Platform attributes the claim to the wrong domain → sameAs and entity schema are missing or wrong; fix entity schema first. No extraction after 60+ days in index → check crawl access and rendered output before any formatting work.

Validate with crawler-oriented tools and live extraction

The browser is the least interesting validator for this job.

Recheck raw HTML and rendered DOM after the fix using the same methods as the baseline
Run URL Inspection (or another crawl-render tool) on representative URLs, and confirm crawlable links, schema, and answer copy are stable across multiple fetches
Re-prompt ChatGPT, Bing Copilot, and Perplexity with the target query and compare against the baseline — test extraction, not just presence; if extraction was already accurate at baseline, leave the format alone
Partial stability is not stability: if only one environment is fixed, keep debugging

Monitor for recurrence by template

Rendering regressions and extraction drift both come back quietly.

Re-test extraction manually 7–14 days after changes — do not assess before re-indexing has propagated; request a re-crawl via URL Inspection if a stale cached version persists
Spot-check the template set after deployments, framework upgrades, or caching changes, and track which components have a history of hiding or rewriting content
Set a quarterly review for high-value pages — citation accuracy drifts as platform behavior changes independent of your content
Keep test results in a version-controlled doc; without a record, recurrence cannot be told apart from a first occurrence

Watch for these failure modes

Checking only a screenshot and assuming crawlers saw the same thing
Testing one homepage or hero template while the real failures live deeper in the site
Optimizing format on a crawl-blocked page — no structural improvement helps an inaccessible page; confirm access first
Running extraction tests within hours of a change — no indexing pathway is that fast; allow at least 7 days
Converting all prose to bullets without preserving precision — over-bulleted pages produce shorter, less accurate extractions when claims lose essential context
Letting client-side code rewrite canonical, robots, or schema on every route change
Treating a single AI platform’s behavior as representative — each has a different training cutoff and retrieval logic; test ChatGPT, Bing, and Perplexity independently