Machine-Readable Infrastructure

Machine-Readable Infrastructure

Last reviewed:

Schema and structured data

  • All product schema is generated from the PIM or commerce platform — no hand-maintained JSON-LD on product templates.
  • Schema on every page type validates without errors or warnings.
    • evidence: validator run on one live URL per template, not on source snippets
  • Schema-declared price and availability match the live commerce API for every priority product in every market. (Structured Data Remediation)
  • Every schema value is visible on the rendered page — no invisible, speculative, or template-defaulted markup.
  • Each catalog category uses its correct schema type (Product with Offer for goods, subscription Offer/PriceSpecification for plans), not one generic template forced onto everything.
  • Schema validation runs as a build-breaking CI gate, not a periodic manual audit.
  • No deprecated or unsupported markup is live. (Schema to Avoid)

Product feeds and catalog data

  • Every catalog shape you sell (physical, digital, subscription) has a feed model that natively represents it — none is forced through a physical-goods template.
  • Feed refresh cadence for price and availability meets AI shopping surface expectations.
    • evidence: feed timestamps show volatile fields refreshed at least hourly, ideally on the 15-minute cadence the surfaces accept
  • Feed values, schema values, and on-page values agree for every priority product — conflicting versions of “what this product is” cause silent exclusion.
  • Per-market feed data (currency, price, availability, ratings) is correct for each market, including preorder and launch-transition states.

Crawler access and bot policy

  • A single documented allow/block matrix exists covering training crawlers, search crawlers, and user-triggered fetchers separately, per bot per property. (Enable AI Search Access)
  • The deployed robots.txt on every domain and subdomain matches the central matrix — no divergence between properties.
  • No search-visibility crawler the matrix allows (OAI-SearchBot, Bingbot, PerplexityBot) is blocked anywhere in practice.
    • evidence: server logs show 200s for each allowed user agent
  • WAF, CDN, and bot-management layers do not override permissive robots.txt rules. (Edge Worker Bot Management)
    • evidence: test-fetch key pages as each AI user agent through the production edge
  • Actual bot traffic in server logs matches declared policy. (Parse Logs for AI Bot Behavior)
  • Meta robots and X-Robots-Tag states match intent on every template — no inherited noindex leaks.
  • Any llms.txt investment is justified by observed fetches in your own server logs, not advocacy. (Schema to Avoid)

Index freshness and discovery

  • IndexNow fires automatically from the publish path on every content change, including after URL moves. (IndexNow Key)
  • Sitemaps contain only canonical, indexable URLs, and lastmod values reflect real changes.
  • All redirects resolve in exactly one hop, and old URLs route to correct destinations — not the homepage.
  • Canonicals are stable, point to indexable targets, and agree with internal links and sitemaps.
    • evidence: verify at the HTTP response level, not only page source — edges and middleware can override

Rendering and extractability

  • Critical content, links, and directives are present in the raw HTTP response — most AI crawlers do not execute JavaScript. (Audit Content Extraction)
  • Spec and comparison tables are semantic HTML with prose summaries, not layout markup or images.
  • Support and product documentation meet the same extractability standards as marketing pages — reasoning modes cite official docs more, not less.
  • No important content sits behind lazy-load triggers, tabs, accordions, or interaction gates.