Technical SEO
Search Foundations and Governance
Core Manuals
- Google Search Essentials: Technical Requirements Overview
- Indexability Requirements: HTTP Status, Robots Rules, and Content Accessibility
- Google Search Spam Policies and Quality Guidelines
- How Google Search Works: Crawling, Rendering, Indexing, and Serving
- Google Search Console: Setup, Verification, and Baseline Monitoring
- Google Search Console URL Inspection Tool: Debugging Crawl and Indexing Issues
- Bing Webmaster Guidelines and Indexing Requirements
- Bing Webmaster Tools: Setup, Verification, and Baseline Monitoring
- Bing Webmaster Tools URL Inspection and Submission Tool
Crawling, Discovery, and URL Control
XML Sitemaps and Discovery
- XML Sitemap Creation and Submission
- Sitemap Index Files for Large Sites
- Image Sitemaps for Content Discovery
- News Sitemaps for Google News Eligibility
Robots.txt and Crawl Control
- robots.txt directives (RFC 9309):
User-agent, Disallow, Allow, Sitemap; Crawl-delay honored by Bing only, ignored by Google
- Meta robots directives:
noindex, nofollow, nosnippet, noarchive, max-snippet:[n], max-image-preview:[setting], max-video-preview:[n] — placed in <meta name="robots"> or per-bot (googlebot, bingbot)
- X-Robots-Tag: HTTP header equivalent of meta robots — use for PDFs, images, and other non-HTML resources
- Crawl budget: Prioritize unique, valuable URLs; reduce crawl waste from parameterized duplicates, infinite scroll traps, and session-ID URLs
- Faceted navigation: Block or noindex low-value filter combinations; canonical parameterized pages to the primary category URL; keep core category paths crawlable
- Internal link crawlability: Use real
<a href> links in navigation — JavaScript-only links, onclick handlers, and CSS-rendered menus may not be followed by crawlers
data-snippet (Bing-specific): HTML attribute applied to any element to designate exactly which text Bing may display or cite in Copilot answers — more precise than NOSNIPPET, which blocks snippets entirely
IndexNow Protocol
- IndexNow: Real-Time URL Submission to Search Engines
- Submit changed URLs immediately to Bing, Yandex, and other participating search engines
- Faster discovery than waiting for next crawl
- Use accurate
lastmod values in XML sitemaps — sitemaps should reflect the last meaningful content change, not the date the sitemap was regenerated
- Support HTTP freshness signals (
ETag, Last-Modified) where available; these allow crawlers to detect changes efficiently without full re-fetching
- See also: AI SEO Crawlability for AI-specific submission workflows
Canonicalization and URL Consolidation
- Self-referencing canonical: Every indexable page should include
<link rel="canonical" href="..."> pointing to itself
- One canonical per page: Multiple conflicting canonicals cause engines to ignore all of them
- HTTPS preferred: Canonical URL should always be the HTTPS version; HTTP should 301 to HTTPS
- Consistent URL format: Match trailing slashes, lowercase, and www/non-www across canonical, internal links, and sitemap
- Don't canonical to redirects: The canonical target must return 200, not a redirect
- Cross-domain canonical: Supported — useful for syndicated content, but the non-canonical URL may still be indexed
- HTTP header canonical: Use
Link: <url>; rel="canonical" for PDFs, images, and non-HTML resources
- Stable, consistent canonical URLs support long-term citation continuity in Bing/Copilot and other answer engines — URL instability breaks grounding references even when content is unchanged
HTTP Redirects and Status Codes
- HTTP Redirects for SEO: 301, 302, 307, 308, Meta Refresh, and JavaScript Redirects
- Site Moves and Domain Migrations
- HTTP Status Codes for SEO
- Soft 404 Detection and Prevention
- Removed Content Handling: 404, 410, and noindex
Indexing and Snippet Controls
- noindex placement: Add
<meta name="robots" content="noindex"> in HTML <head>, or send X-Robots-Tag: noindex as an HTTP header for non-HTML resources
- noindex vs. canonical: Use
noindex to remove a page from the index entirely; use rel="canonical" to consolidate signals toward a preferred duplicate — do not combine both on the same page
- Pagination: Ensure all paginated pages are linked via crawlable
<a href> links; rel="next/prev" is deprecated but still treated as a hint by some engines; avoid noindexing deep pages that contain unique products or content
- Snippet-control directives (
NOSNIPPET, data-nosnippet, NOCACHE, NOARCHIVE) affect AI answer preview richness and citation depth, especially in Bing/Copilot — review these controls specifically for grounding eligibility, not only for standard snippet suppression
- Google Query Fan-Out: AI Overviews and AI Mode may issue multiple related sub-queries across subtopics; well-interlinked subtopic pages are more discoverable through fan-out than through single-query indexing alone
Implementation and Migration Guides
- New Site Launch SEO Playbook: Pre-Launch Checklist and Go-Live Process
- Site Migration SEO Playbook: Redirects, Canonicals, Crawl, and Monitoring
Crawling and Site Auditing Tools
Rendering, JavaScript, Mobile, and Performance
JavaScript SEO
- Rendering pipeline: Google crawls HTML first, then queues pages for rendering (can take seconds to days) — content requiring JS to appear is not indexed until rendered
- SSR / SSG preferred: Server-side rendering or static-site generation ensures content is in the initial HTML response; client-side rendering delays discovery
- Dynamic rendering: Serve pre-rendered HTML to bots if SSR is not feasible — treat as a temporary workaround, not a long-term solution
- Critical content check: Primary text, links, canonicals, meta robots, and structured data must all appear in the rendered HTML — test with URL Inspection → "View Tested Page"
- Lazy-loading risks: Content behind scroll-triggered or click-triggered lazy loading may never be rendered by Googlebot — use native
loading="lazy" on images only, not on primary text content
- JS-generated SEO signals: Canonicals, hreflang, and structured data injected by JavaScript are supported but riskier — render delays and errors can cause them to be missed
- SPA routing: Use History API with real URLs (not hash fragments); each route must return a unique, fully-rendered page with correct status codes
- Testing: Compare raw HTML source vs. rendered HTML in URL Inspection; use Puppeteer, Rendertron, or DevTools to debug render issues locally
HTML5 and Accessibility
- Semantic landmarks: Use
<main> (one per page — primary content), <article> (self-contained piece), <section> (thematic grouping with heading), <nav> (navigation blocks), <header>/<footer> (page or section level), <aside> (tangentially related content)
- Why it matters for SEO: Semantic elements help search engines identify primary content vs. boilerplate —
<main> signals the core content area, <nav> identifies navigation links, <aside> marks secondary content
- ARIA landmarks: Use
role="banner", role="navigation", role="main", role="contentinfo" when semantic HTML elements are not available — avoid redundant ARIA on elements that already have implicit roles
- Heading hierarchy: One
<h1> per page, logical nesting (h2 → h3 → h4), no skipped levels — headings are strong relevance signals
- Cookie banner / CMP risks: Consent banners rendered via JavaScript can inject content above
<main> that shifts layout (CLS), block rendering of primary content, or add noindex-like behavior if misconfigured — test rendered HTML with and without consent
Mobile-First Indexing
- Default crawl agent: Google uses the mobile Googlebot (smartphone user agent) for all crawling and indexing — the desktop version of your site is not what gets indexed
- Content parity required: All content, links, structured data, meta tags, and alt text must be present on the mobile version — desktop-only content will not be indexed
- Responsive design preferred: Single URL, same HTML, CSS adapts layout — simplest to maintain and least error-prone
- Dynamic serving: Same URL but different HTML per user agent — must use
Vary: User-Agent header; higher risk of parity drift
- Separate mobile URLs (m.*): Supported but complex — requires bidirectional annotations (
rel="alternate" / rel="canonical"), easy to misconfigure
- Verification: Use URL Inspection in GSC to confirm the mobile render matches expectations — check that no content, links, or structured data are missing from the mobile version
Core Web Vitals and Performance
- LCP (Largest Contentful Paint) — target < 2.5 s: Optimize hero image (compress, use modern format, preload), reduce server response time (TTFB), eliminate render-blocking CSS/JS, use CDN for static assets
- INP (Interaction to Next Paint) — target < 200 ms: Break long JavaScript tasks into smaller chunks, defer non-critical scripts, reduce main-thread work, use
requestIdleCallback for low-priority work, minimize DOM size
- CLS (Cumulative Layout Shift) — target < 0.1: Set explicit
width / height on images and embeds, reserve space for ads and dynamic content, avoid injecting content above the fold after load, use font-display: swap with size-adjusted fallback fonts
- Field vs. lab data: Field data (CrUX / real users) determines ranking impact — lab data (Lighthouse / WebPageTest) is for debugging only; a page can pass lab tests but fail in the field due to real-world device and network conditions
- Page experience signals: CWV + HTTPS + no intrusive interstitials + mobile-friendly = page experience; these are tie-breaker signals, not dominant ranking factors
- Crawl efficiency: Server response time affects crawl rate — slow TTFB reduces how many pages Googlebot can crawl per session; 5xx errors and timeouts waste crawl budget
Performance and Rendering Tools
International and Multi-Regional SEO
Hreflang and International SEO
- Hreflang syntax:
<link rel="alternate" hreflang="en-US" href="…" /> — language code (ISO 639-1) optionally followed by region code (ISO 3166-1 Alpha 2)
- x-default: Use
hreflang="x-default" for the fallback page (usually language selector or global homepage) — required for complete hreflang sets
- Placement options: HTML
<head> (most common), HTTP Link header (for non-HTML like PDFs), or XML sitemap (best for large-scale sites) — pick one method per URL set
- Return links required: Every hreflang annotation must be reciprocal — if page A references page B, page B must reference page A; missing return links cause the annotation to be ignored
- URL structure options: ccTLDs (strongest geo signal, highest cost), subdomains (gTLD + subdomain, moderate signal), subdirectories (simplest, relies on GSC geotargeting) — all are valid, choose based on infrastructure
- Canonical + hreflang alignment: The canonical URL of each page must be the URL used in hreflang annotations — if a page canonicalizes to a different URL, its hreflang is ignored
- Common mistakes: Mixing language-only and language+region codes inconsistently, using country codes where language codes are needed (e.g.,
hreflang="uk" is wrong — use hreflang="en-GB"), forgetting self-referencing hreflang
- Avoid IP-based redirects: Do not redirect crawlers based on IP geolocation — Googlebot crawls primarily from the US; geo-redirects can prevent non-local versions from being indexed
Monitoring, QA, Log Analysis, and SEO Operations
Monitoring and QA Workflows
- GSC Performance report: Monitor clicks, impressions, CTR, and position by query, page, country, device — check weekly for sudden drops; compare date ranges to detect regressions
- GSC Page Indexing report: Review "Not indexed" reasons (crawled but not indexed, discovered but not crawled, noindex, redirect, 404) — prioritize fixing pages that should be indexed
- GSC Rich Results report: Monitor valid/invalid structured data items — errors here mean lost rich result eligibility; check after any schema deployment
- GSC Crawl Stats report: Review total crawl requests, response codes, file types, crawl purpose (discovery vs. refresh) — sudden crawl drops may signal server issues or robots.txt changes
- GSC Sitemaps report: Confirm submitted sitemaps are processed, track discovered vs. indexed URL counts, watch for errors in sitemap parsing
- Bing WMT equivalents: Search Performance (clicks, impressions, CTR), Crawl Information (crawl activity, errors), SEO Reports (site scan for common issues), URL Inspection
- Video Indexing report: Check coverage for pages with VideoObject schema — specific issues include "video not detected" and "thumbnail missing"
- Structured data QA: After deploying schema changes, validate with Rich Results Test, then monitor GSC Rich Results report for 2–4 weeks for new errors
- Recrawl request workflow: Use URL Inspection → "Request Indexing" in GSC for urgent pages; use IndexNow for Bing; submit updated sitemaps for bulk changes
Regression Testing and QA
- Post-deploy checks: After every release, verify: canonical tags intact, meta robots unchanged, structured data still valid, redirects still working, rendered HTML matches expectations
- Staging environment controls: Block staging from crawlers using
noindex + robots.txt Disallow + password protection (belt-and-suspenders) — a single method alone is not reliable enough
- CMS template governance: Embed SEO fields (title, meta description, canonical, schema, robots) directly into page templates — do not rely on manual entry per page; template changes should trigger SEO review
- Automated regression tests: Add SEO assertions to CI/CD: check for
<title>, canonical, noindex absence on production pages, valid JSON-LD, correct HTTP status codes
- Post-redesign validation: Compare pre/post crawl data (Screaming Frog diff) for canonical changes, missing pages, orphaned URLs, broken internal links, lost structured data
- Component library QA: If structured data is embedded in reusable components, test each component variant in isolation — a bug in a shared component affects every page using it
Log File Analysis
- What to extract: Bot user agent, requested URL, response code, response time, bytes sent, timestamp — filter to known search bot user agents for SEO analysis
- Bot segmentation: Separate verified Googlebot, Bingbot, OAI-SearchBot, Applebot, PerplexityBot from unknown bots and scrapers — use reverse DNS verification (not just user-agent strings) for accurate segmentation
- Crawl frequency analysis: Group crawl hits by URL pattern (template type, directory, depth) — identify which sections get crawled most/least; compare against sitemap priority
- Crawl waste identification: Flag URLs that consume crawl budget without value: parameterized duplicates, paginated archives, soft 404s, trapped faceted navigation, calendar/infinite scroll pages
- Correlate with index coverage: Cross-reference log data with sitemap URLs and GSC Page Indexing report — pages in sitemap but never crawled indicate discovery problems; frequently crawled but not indexed suggests quality issues
- Migration and launch monitoring: During site migrations, monitor logs in real time for: old URL crawl drop-off, new URL crawl pickup, redirect chain hits, unexpected 404/410 spikes, crawl rate changes
Bot Validation and Management
- Googlebot verification: Reverse DNS lookup on the IP must resolve to
*.googlebot.com or *.google.com, then forward DNS must resolve back to the original IP
- Bingbot verification: Reverse DNS must resolve to
*.search.msn.com; Bing also publishes IP ranges in Bing Webmaster Tools documentation
- AI search bot IPs: OAI-SearchBot publishes ranges in searchbot.json; Applebot publishes ranges in applebot.json — allowlist these in WAF/CDN rules
- Spoofed user-agent detection: Any bot claiming to be Googlebot or Bingbot but failing reverse+forward DNS is spoofed — block or deprioritize
- WAF/CDN allowlisting: Verify that rate limiting, bot mitigation, and CAPTCHA rules do not accidentally block legitimate search and AI crawlers
- Log segmentation: Segment server logs by verified bot (Googlebot, Bingbot, OAI-SearchBot, Applebot, PerplexityBot) vs. unknown bots to distinguish valuable crawl activity from scraping and abuse
- Bing explicitly ties crawl waste (duplicate low-value URLs, thin pages, excessive pagination) to reduced indexing depth and lower grounding eligibility — crawl budget is not only a traditional SEO concern but also an AI visibility concern
CDN and Caching for SEO
- Cache-Control: Use
Cache-Control: public, max-age= for static assets; for HTML pages, use short TTLs or no-cache with ETag / Last-Modified to balance freshness and performance
- Vary header: Use
Vary: User-Agent if serving different HTML per device (dynamic serving) — without it, CDN may serve mobile HTML to Googlebot desktop or vice versa
- ETag / Last-Modified: Enable conditional requests (
If-None-Match, If-Modified-Since) so crawlers get 304 Not Modified for unchanged content — reduces crawl load
- Stale content risk: Aggressive caching can serve outdated HTML (old canonicals, removed noindex, stale structured data) to crawlers — purge cache after SEO-critical changes
- Cache invalidation: After publishing content updates, redirects, or schema changes, invalidate CDN cache for affected URLs immediately — delayed purging can cause crawlers to see old content for hours or days
Edge Routing and Redirects
- Header forwarding: Ensure origin SEO headers (
X-Robots-Tag, Link rel="canonical", Vary) pass through edge/CDN layers unchanged — CDN stripping or overwriting headers is a common silent failure
- Canonical preservation across redirects: When edge rules add/remove trailing slashes, force HTTPS, or normalize domains, ensure the final canonical URL matches the redirect destination — mismatches confuse indexing
- Redirect chain management: Edge redirects stacked on origin redirects create chains — audit total hops regularly; aim for single-hop redirects from any old URL to the current canonical
- Geo-based routing risks: IP-based geo-redirects at the edge can block Googlebot (US IPs) from reaching non-US content — use hreflang instead; if geo-routing is required, exempt known bot IPs
- Query parameter normalization: Edge rules that strip, sort, or rewrite query parameters must preserve parameters needed for tracking and canonical consistency — test with crawl tools after deploying new rules
- Edge rewrite validation: URL rewrites at the edge (e.g., path mapping, vanity URLs) must produce correct status codes and consistent canonical signals — test both browser and bot user agents
Response Debugging for SEO
- Origin vs. edge comparison: Fetch the same URL directly from origin and through CDN — compare response headers (especially
X-Robots-Tag, canonical, status code) and HTML body to detect CDN-introduced discrepancies
- 5xx diagnosis: Intermittent 5xx errors in GSC Crawl Stats indicate server or edge instability — correlate with CDN logs, origin health checks, and deployment timestamps; Googlebot may reduce crawl rate after repeated 5xx
- Header and HTML parity: Compare responses across browser, Googlebot user agent, and CDN edge — look for differences in rendered content, meta tags, status codes, and headers that could cause differential indexing
- Render blocking: CDN or WAF rules that block JavaScript, CSS, or font resources from Googlebot cause incomplete rendering — check GSC URL Inspection "Page resources" for blocked resource errors
- Incident response for crawl drops: If GSC shows sudden crawl or indexing drops, check in order: CDN/edge config changes, robots.txt accessibility, DNS resolution, origin server health, recent deploys that may have introduced noindex or redirect loops
Quick Actions
URL Not Indexing?
- Run URL Inspection in GSC
- Check HTTP status (must be 200)
- Review robots.txt rules
- Check for noindex tag
- Verify canonical points to itself
- Test rendered HTML
- Request indexing
Site Migration Checklist
- Map all old URLs to new URLs
- Implement 301 redirects
- Update internal links
- Update XML sitemap
- Submit new sitemap to GSC/Bing
- Monitor crawl stats daily
- Track indexation status
Core Web Vitals Issues?
- Run PageSpeed Insights
- Check LCP (image optimization)
- Check INP (JavaScript execution)
- Check CLS (layout shifts)
- Review field data in CrUX
- Test on real devices
- Monitor GSC Core Web Vitals report