Machine-Readable Infrastructure

Parse Log Files for AI Bot Behavior

Last reviewed:

Robots.txt and Google Search Console tell you what should happen. Server logs tell you what did happen — and for AI bots, the gap between those two is frequently large and consequential. This isolates AI bot behavior from general traffic, interprets response-code patterns, and builds ongoing monitoring against future access failures.

Preconditions: raw server access logs (Apache/Nginx) or CDN access logs (Cloudflare, Fastly) covering at least 7–14 days; command-line access or a log analysis tool (GoAccess, Screaming Frog Log Analyzer); current robots.txt for cross-reference.

Assemble a complete UA pattern reference

AI crawler UA strings change version numbers; pattern matching on the token beats exact string matching.

  • Current confirmed tokens (version numbers vary): GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, claude-web, PerplexityBot, PerplexityBot-User, Bingbot, Applebot, Applebot-Extended, meta-externalagent
  • Match on token presence, not exact string — grep -i "GPTBot" matches GPTBot/1.1 and future version strings without modification
  • Distinguish training crawlers from real-time search bots in your pattern list — mixing them conflates unrelated access events
  • Build a versioned reference file (ai-bots.txt), one pattern per line, so future UA additions and deprecations are tracked

Extract a date-scoped sample and filter for AI bots

Full unfiltered logs at scale are not workable — scope and filter before analysing.

  • Extract a 14-day minimum window, then pipe through your AI UA patterns: grep -E "(GPTBot|OAI-SearchBot|PerplexityBot|ClaudeBot|claude-web|Bingbot|Applebot|meta-externalagent)" access.log > ai-bots.log
  • Verify the filtered file is non-empty before continuing — an empty file could mean no crawlers, or could mean the UA string format in your logs differs from expectation
  • If using CDN logs, confirm they include bot traffic — some CDN configurations cache bot responses without logging the original request; verify against origin logs

Aggregate crawl activity by bot

Raw log lines aren’t interpretable at volume — aggregate into counts before drawing conclusions.

  • Count requests per bot per day and unique URLs crawled per bot, to separate discovery breadth from crawl frequency
  • Identify crawl velocity (requests per hour per bot) — spikes indicate a crawl burst triggered by a sitemap resubmission, IndexNow ping, or link discovery
  • Note first-seen and last-seen dates per bot — a bot that stopped appearing is more diagnostically significant than one appearing consistently

Identify what each bot is crawling

Crawl pattern reveals which content each platform prioritizes and which it ignores.

  • Extract URL paths per bot with frequency counts
  • Separate content pages from assets (CSS, JS, images) — asset crawls without accompanying content crawls indicate a headless-browser rendering pipeline
  • Cross-reference crawled URLs against your priority URL list — gaps are candidates for sitemap or internal-link fixes, not robots.txt changes

Audit response codes per bot

Response-code distribution reveals whether access controls are working as intended.

  • 301s/302s on content pages: the bot is hitting non-canonical URLs — check the redirect chain resolves in one hop
  • 403s from AI bots: a WAF or IP-range block is in effect — verify it’s intentional and applied to the correct bot class (training vs. real-time search)
  • 429s: confirm the throttling rate is appropriate — aggressive throttling of OAI-SearchBot or Bingbot has direct search-traffic consequences
  • 5xx from AI bots: origin failure under load, invisible in standard UX monitoring

Validate: for each bot showing 200s, confirm the corresponding robots.txt Disallow rules aren’t set for those paths — a bot returning 200 on a Disallow path is ignoring the directive.

Cross-reference logs against robots.txt directives

Logs and robots.txt should tell the same story. When they diverge, one of them is the problem.

  • For bots with zero 200s on Allowed paths, check WAF and CDN rules — a block at the edge makes robots.txt irrelevant
  • Verify /robots.txt itself returns 200 for each bot — a 429 or 503 on robots.txt causes well-behaved bots to treat the entire domain as blocked
  • Check crawl timing — clustering within seconds may mean a Crawl-delay directive isn’t being honoured

Identify crawl inefficiencies

Crawl budget wasted on low-value URLs delays freshness and re-indexation of priority content.

  • Redirect chains (A→B→C) waste crawl budget — reduce to single-hop
  • High crawl volume on pagination, filter, or faceted-nav URLs at the expense of content pages signals an internal-linking problem
  • URLs returning 404 that are still in the crawl queue are stale seed entries — update the sitemap
  • Duplicate content crawls at multiple URLs dilute crawl efficiency — check parameter-handling rules

Set up ongoing monitoring

A one-time log pass is a snapshot. Monitoring is the operational goal.

  • Configure a daily summary per bot: request count, unique URL count, 4xx count, 5xx count
  • Alert on: zero requests from a real-time search bot for 72+ hours, a 403 rate above 5% for any bot, or a 5xx rate above 2% during known crawl windows
  • Archive raw AI bot logs separately, 90 days minimum — investigations frequently need historical comparison
  • Review the UA pattern reference file monthly

Decision point: OAI-SearchBot absent from logs for 7+ days → check robots.txt for new Disallow rules and CDN firewall changelogs first; absence of a real-time search bot has immediate citation consequences. GPTBot showing 200s but training output is wrong → 200s confirm access, not training-data selection; access does not guarantee inclusion in the training set.

Watch for these failure modes

  • Analysing CDN logs without confirming they include bot traffic
  • Matching on bot as a UA substring — this matches monitoring agents and SEO tools alongside AI crawlers; use the specific published token
  • Treating bot absence as proof of a block — a well-behaved bot on a multi-day crawl cycle needs at least 14 days before you conclude it’s blocked
  • Assuming CDN 200s equal origin 200s — cached CDN responses don’t appear in origin logs and can mask 5xx conditions
  • Making robots.txt changes based on one day of log data