Machine-Readable Infrastructure
Last reviewed:
Schema and structured data
- All product schema is generated from the PIM or commerce platform — no hand-maintained JSON-LD on product templates.
- Schema on every page type validates without errors or warnings.
- evidence: validator run on one live URL per template, not on source snippets
- Schema-declared price and availability match the live commerce API for every priority product in every market. (Structured Data Remediation)
- Every schema value is visible on the rendered page — no invisible, speculative, or template-defaulted markup.
- Each catalog category uses its correct schema type (Product with Offer for goods, subscription Offer/PriceSpecification for plans), not one generic template forced onto everything.
- Schema validation runs as a build-breaking CI gate, not a periodic manual audit.
- No deprecated or unsupported markup is live. (Schema to Avoid)
Product feeds and catalog data
- Every catalog shape you sell (physical, digital, subscription) has a feed model that natively represents it — none is forced through a physical-goods template.
- Feed refresh cadence for price and availability meets AI shopping surface expectations.
- evidence: feed timestamps show volatile fields refreshed at least hourly, ideally on the 15-minute cadence the surfaces accept
- Feed values, schema values, and on-page values agree for every priority product — conflicting versions of “what this product is” cause silent exclusion.
- Per-market feed data (currency, price, availability, ratings) is correct for each market, including preorder and launch-transition states.
Crawler access and bot policy
- A single documented allow/block matrix exists covering training crawlers, search crawlers, and user-triggered fetchers separately, per bot per property. (Enable AI Search Access)
- The deployed robots.txt on every domain and subdomain matches the central matrix — no divergence between properties.
- No search-visibility crawler the matrix allows (OAI-SearchBot, Bingbot, PerplexityBot) is blocked anywhere in practice.
- evidence: server logs show 200s for each allowed user agent
- WAF, CDN, and bot-management layers do not override permissive robots.txt rules. (Edge Worker Bot Management)
- evidence: test-fetch key pages as each AI user agent through the production edge
- Actual bot traffic in server logs matches declared policy. (Parse Logs for AI Bot Behavior)
- Meta robots and X-Robots-Tag states match intent on every template — no inherited noindex leaks.
- Any llms.txt investment is justified by observed fetches in your own server logs, not advocacy. (Schema to Avoid)
Index freshness and discovery
- IndexNow fires automatically from the publish path on every content change, including after URL moves. (IndexNow Key)
- Sitemaps contain only canonical, indexable URLs, and lastmod values reflect real changes.
- All redirects resolve in exactly one hop, and old URLs route to correct destinations — not the homepage.
- Canonicals are stable, point to indexable targets, and agree with internal links and sitemaps.
- evidence: verify at the HTTP response level, not only page source — edges and middleware can override
Rendering and extractability
- Critical content, links, and directives are present in the raw HTTP response — most AI crawlers do not execute JavaScript. (Audit Content Extraction)
- Spec and comparison tables are semantic HTML with prose summaries, not layout markup or images.
- Support and product documentation meet the same extractability standards as marketing pages — reasoning modes cite official docs more, not less.
- No important content sits behind lazy-load triggers, tabs, accordions, or interaction gates.