Sources

Curated links move to the Sources block on the page that actually uses them — every playbook, template, briefing, and news issue carries one. This page is for the genuinely cross-cutting libraries: docs, crawler identity, and tools that don’t belong to any single page.

Official documentation

Search engines

Google Search Central — the canonical Google Search doc tree.
Google Search Essentials — the baseline technical, spam, and key-practice rules for eligibility.
How Google Search Works — the crawl, index, and ranking pipeline at the system level.
Google Search Central Blog — official announcements, deprecations, and clarifications.
Google Search ranking systems guide — the canonical list of active ranking systems; don’t conflate “systems,” “spam policies,” and “core updates.”
Google spam policies — cloaking, doorway pages, link spam, machine-generated content, scraped content.
Google Search update history — confirmed dates for core, spam, and product-review updates.
Google manual actions documentation — how to identify a manual action and file reconsideration.
Google Search Quality Rater Guidelines (PDF) — the human evaluation framework E-E-A-T and YMYL are defined in.
Google Search Status Dashboard — check before treating a site issue as a Google-side incident.
Bing Webmaster Guidelines — Microsoft’s core search rules and quality expectations.
Bing Webmaster Blog — Microsoft’s official announcement channel.

AI features and crawler documentation

Google AI features and your website — AI Overviews, AI Mode, and the site controls that actually apply.
About AI Overviews — Google’s help-center explanation of how AI Overviews work.
Google: creating helpful, reliable, people-first content — the closest official statement of what Google rewards at the content level.
Google: common crawlers, including Google-Extended — Googlebot variants and the AI-training opt-out token, explained.
OpenAI bot documentation — GPTBot, OAI-SearchBot, OAI-AdsBot, and ChatGPT-User, and which pipeline each controls.
Anthropic: does Anthropic crawl data from the web — ClaudeBot policy and opt-out.
PerplexityBot documentation — user-agent, IP ranges, and robots.txt handling.
About Applebot — Apple’s crawler documentation for Siri and Apple Search.

Standards and structured data

Schema.org — the root vocabulary for structured data types and properties.
Google: general structured data guidelines — technical and quality requirements for rich-result eligibility.
RFC 9309 — Robots Exclusion Protocol — the formal standard for robots.txt behavior.
Google: robots.txt introduction — Google’s crawler-specific interpretation.
Google: robots meta tag and X-Robots-Tag — page-level and header-level crawl and snippet directives.
Google: JavaScript SEO basics — crawl, render, index as distinct phases.
Sitemaps Protocol — the reference spec for sitemap and sitemap-index XML.
IndexNow documentation — implementation details for keys, endpoints, and submission mechanics.
Google: redirects and Google Search — permanent vs. temporary redirect handling.

Entity and knowledge graph

Google Knowledge Graph API — authentication, query syntax, and response format for kgsearch.googleapis.com.
Wikidata — structured knowledge base feeding Google’s Knowledge Graph and LLM grounding.
Wikidata SPARQL Query Service — full SPARQL endpoint for structured entity queries.
Wikidata notability policy — required before creating a new entity entry.
Google: claim a Knowledge Panel — the correction pathway for verified owners.

Commerce

Google Merchant Center product feed spec — the core spec for Google Shopping and merchant data ingestion.
Schema.org Product — the on-page product vocabulary a feed should agree with.

Crawler user-agents & verified IP ranges

OAI-SearchBot (ChatGPT Search), GPTBot (training), OAI-AdsBot (ad-page verification), ChatGPT-User (user-triggered) — four independent OpenAI tokens; see OpenAI bot documentation for current IP ranges.
Googlebot (Search indexing, drives AI Overviews/AI Mode) vs. Google-Extended (Gemini training opt-out only, no Search effect) — see Google’s common crawlers.
Googlebot IP ranges (JSON) — published, updated by Google.
Bingbot (Bing Search and Copilot, one token for both) — see Bing: which crawlers does Bing use; no comprehensive static IP file, verify via reverse DNS to *.search.msn.com.
Verify Bingbot — Microsoft’s authenticity-check tool; don’t trust the user-agent string alone.
PerplexityBot — see PerplexityBot documentation for published IP ranges.
ClaudeBot — Anthropic publishes crawler policy but not a stable standalone IP file; combine UA validation with log review.
Applebot — see About Applebot; Apple publishes an IP file.
ASN reference: Google is AS15169, Microsoft is AS8075. ASN matching is useful for triage, not proof of bot identity — pair it with UA and, where available, reverse-DNS or published-IP verification. IP ranges move; a WAF rule with no review cadence becomes a silent outage.

Tool landscape

First-party reporting

Google Search Console — Google reflects AI-feature activity inside standard web reporting, not a standalone AI dashboard.
Google Search Console: Performance report documentation — what impressions, clicks, and position include and exclude.
Google Analytics — for judging whether a visibility change turned into actual visits or business behavior.
Bing Webmaster Tools — currently the strongest first-party AI citation reporting surface (AI Performance).

AI visibility monitoring (third-party, directional — not ground truth)

Profound — repeated brand and citation monitoring across AI answer environments.
seoClarity — AI visibility inside a broader enterprise SEO workflow.
Ahrefs Brand Radar — brand mention and visibility-pattern tracking.
Semrush — AI visibility alongside conventional competitive and keyword workflows.
OtterlyAI — lighter prompt tracking and recurring answer checks.
Peec AI — narrower monitoring around AI answer visibility and citation movement.
Scrunch — directional monitoring for repeated prompt and answer checks.

Crawling, validation, and monitoring

Screaming Frog SEO Spider — the default crawler for hands-on technical audits and extraction diffing.
Screaming Frog Log File Analyser — inspect crawl behavior from logs.
Botify — enterprise crawl and log-analysis tooling for sites past desktop-crawler scale.
Google Rich Results Test — validator for rich-result eligible markup.
Schema Markup Validator — broader Schema.org syntax and structure validation, independent of Google’s parser.
Google PageSpeed Insights — Core Web Vitals field and lab comparison for a single URL.
Chrome UX Report (CrUX) — public field performance data at origin and URL-pattern level; field data decides Core Web Vitals pass/fail, lab tools are for debugging.
WebPageTest — request-level detail, filmstrips, repeatable throttled tests.
Lighthouse — repeatable lab audits; not a substitute for field data.
web-vitals JavaScript library — Google’s own real-user-monitoring measurement library.

Research and competitive tools

Ahrefs — backlinks, content-gap work, competitive research.
SISTRIX — particularly useful in some international markets.
Similarweb — market share, traffic-shape, and category movement rather than page-level diagnosis.
Google Trends — directional demand shifts, not absolute keyword volume.

Answer engines and AI search surfaces worth tracking

ChatGPT Search, Perplexity, Microsoft Copilot, Google AI Mode — the primary surfaces most citation and extraction work targets.
Brave Search, DuckDuckGo, You.com — worth checking when independent-index or privacy-oriented behavior is part of the audience mix, not a default priority.
International: Baidu (China), Yandex (Russia/nearby), Naver (South Korea) — relevant only when geography or language makes them real traffic surfaces.

Industry publications (context and synthesis — verify against primary sources above)

Search Engine Land, Search Engine Roundtable, Search Engine Journal — fast reporting; a good first stop, not the final word.
Ahrefs Blog — practical studies and hands-on experimentation.

Rule of readmission: a link removed from this page returns only with evidence — a request, a citation, or usage data — never for catalog completeness.