Parse Log Files for AI Bot Behavior
Last reviewed:
Robots.txt and Google Search Console tell you what should happen. Server logs tell you what did happen — and for AI bots, the gap between those two is frequently large and consequential. This isolates AI bot behavior from general traffic, interprets response-code patterns, and builds ongoing monitoring against future access failures.
Preconditions: raw server access logs (Apache/Nginx) or CDN access logs (Cloudflare, Fastly) covering at least 7–14 days; command-line access or a log analysis tool (GoAccess, Screaming Frog Log Analyzer); current robots.txt for cross-reference.
Assemble a complete UA pattern reference
AI crawler UA strings change version numbers; pattern matching on the token beats exact string matching.
- Current confirmed tokens (version numbers vary):
GPTBot,OAI-SearchBot,ChatGPT-User,ClaudeBot,claude-web,PerplexityBot,PerplexityBot-User,Bingbot,Applebot,Applebot-Extended,meta-externalagent - Match on token presence, not exact string —
grep -i "GPTBot"matchesGPTBot/1.1and future version strings without modification - Distinguish training crawlers from real-time search bots in your pattern list — mixing them conflates unrelated access events
- Build a versioned reference file (
ai-bots.txt), one pattern per line, so future UA additions and deprecations are tracked
Extract a date-scoped sample and filter for AI bots
Full unfiltered logs at scale are not workable — scope and filter before analysing.
- Extract a 14-day minimum window, then pipe through your AI UA patterns:
grep -E "(GPTBot|OAI-SearchBot|PerplexityBot|ClaudeBot|claude-web|Bingbot|Applebot|meta-externalagent)" access.log > ai-bots.log - Verify the filtered file is non-empty before continuing — an empty file could mean no crawlers, or could mean the UA string format in your logs differs from expectation
- If using CDN logs, confirm they include bot traffic — some CDN configurations cache bot responses without logging the original request; verify against origin logs
Aggregate crawl activity by bot
Raw log lines aren’t interpretable at volume — aggregate into counts before drawing conclusions.
- Count requests per bot per day and unique URLs crawled per bot, to separate discovery breadth from crawl frequency
- Identify crawl velocity (requests per hour per bot) — spikes indicate a crawl burst triggered by a sitemap resubmission, IndexNow ping, or link discovery
- Note first-seen and last-seen dates per bot — a bot that stopped appearing is more diagnostically significant than one appearing consistently
Identify what each bot is crawling
Crawl pattern reveals which content each platform prioritizes and which it ignores.
- Extract URL paths per bot with frequency counts
- Separate content pages from assets (CSS, JS, images) — asset crawls without accompanying content crawls indicate a headless-browser rendering pipeline
- Cross-reference crawled URLs against your priority URL list — gaps are candidates for sitemap or internal-link fixes, not robots.txt changes
Audit response codes per bot
Response-code distribution reveals whether access controls are working as intended.
- 301s/302s on content pages: the bot is hitting non-canonical URLs — check the redirect chain resolves in one hop
- 403s from AI bots: a WAF or IP-range block is in effect — verify it’s intentional and applied to the correct bot class (training vs. real-time search)
- 429s: confirm the throttling rate is appropriate — aggressive throttling of
OAI-SearchBotorBingbothas direct search-traffic consequences - 5xx from AI bots: origin failure under load, invisible in standard UX monitoring
Validate: for each bot showing 200s, confirm the corresponding robots.txt Disallow rules aren’t set for those paths — a bot returning 200 on a Disallow path is ignoring the directive.
Cross-reference logs against robots.txt directives
Logs and robots.txt should tell the same story. When they diverge, one of them is the problem.
- For bots with zero 200s on Allowed paths, check WAF and CDN rules — a block at the edge makes robots.txt irrelevant
- Verify
/robots.txtitself returns 200 for each bot — a 429 or 503 on robots.txt causes well-behaved bots to treat the entire domain as blocked - Check crawl timing — clustering within seconds may mean a Crawl-delay directive isn’t being honoured
Identify crawl inefficiencies
Crawl budget wasted on low-value URLs delays freshness and re-indexation of priority content.
- Redirect chains (A→B→C) waste crawl budget — reduce to single-hop
- High crawl volume on pagination, filter, or faceted-nav URLs at the expense of content pages signals an internal-linking problem
- URLs returning 404 that are still in the crawl queue are stale seed entries — update the sitemap
- Duplicate content crawls at multiple URLs dilute crawl efficiency — check parameter-handling rules
Set up ongoing monitoring
A one-time log pass is a snapshot. Monitoring is the operational goal.
- Configure a daily summary per bot: request count, unique URL count, 4xx count, 5xx count
- Alert on: zero requests from a real-time search bot for 72+ hours, a 403 rate above 5% for any bot, or a 5xx rate above 2% during known crawl windows
- Archive raw AI bot logs separately, 90 days minimum — investigations frequently need historical comparison
- Review the UA pattern reference file monthly
Decision point: OAI-SearchBot absent from logs for 7+ days → check robots.txt for new Disallow rules and CDN firewall changelogs first; absence of a real-time search bot has immediate citation consequences. GPTBot showing 200s but training output is wrong → 200s confirm access, not training-data selection; access does not guarantee inclusion in the training set.
Watch for these failure modes
- Analysing CDN logs without confirming they include bot traffic
- Matching on
botas a UA substring — this matches monitoring agents and SEO tools alongside AI crawlers; use the specific published token - Treating bot absence as proof of a block — a well-behaved bot on a multi-day crawl cycle needs at least 14 days before you conclude it’s blocked
- Assuming CDN 200s equal origin 200s — cached CDN responses don’t appear in origin logs and can mask 5xx conditions
- Making robots.txt changes based on one day of log data