Machine-Readable Infrastructure

Log Parsing Regex Snippets

Last reviewed:

These commands assume Nginx combined log format. Adapt the log path and field positions for Apache, Caddy, or CDN log exports. For ongoing monitoring rather than one-off queries, use the parse-logs-for-ai-bot-behavior playbook.

grep extraction patterns

# Extract all AI crawler lines from a combined access log
# Replace /var/log/nginx/access.log with your actual log path

# All known AI crawlers in one pass
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|Google-Extended|ClaudeBot|Applebot-Extended|CCBot|Amazonbot|YouBot|Bytespider|DuckAssistBot|Meta-ExternalAgent|Meta-ExternalFetcher" /var/log/nginx/access.log > ai-bots.log

# Training crawlers only
grep -E "GPTBot|Google-Extended|ClaudeBot|Applebot-Extended|CCBot|Bytespider" /var/log/nginx/access.log > training-bots.log

# Search/retrieval crawlers only
grep -E "OAI-SearchBot|PerplexityBot|Bingbot|YouBot|DuckAssistBot|Amazonbot" /var/log/nginx/access.log > search-bots.log

# Count hits per crawler (sort descending)
grep -oE "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|Google-Extended|ClaudeBot|CCBot|Amazonbot|YouBot|Bytespider|DuckAssistBot" /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn

# Count hits per crawler by response code
awk '/GPTBot|OAI-SearchBot|PerplexityBot/ {print $9, $0}' /var/log/nginx/access.log \
  | grep -oE "^[0-9]+" \
  | sort | uniq -c | sort -rn

awk analysis patterns

# Unique URLs crawled by GPTBot
# Nginx default log format: $remote_addr - $remote_user [$time_local] "$request" $status ...
awk '/GPTBot/ {
  split($7, a, "?");
  print a[1]
}' /var/log/nginx/access.log | sort -u

# Hourly request rate for OAI-SearchBot
awk '/OAI-SearchBot/ {
  split($4, t, ":");
  printf "%s %s\n", substr($4,2,11), t[2]
}' /var/log/nginx/access.log | sort | uniq -c

# 4xx and 5xx responses served to AI crawlers
awk '($0 ~ /GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|ClaudeBot/) && ($9 ~ /^[45]/) {
  print $9, $7
}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# Crawl rate by hour for all AI bots combined
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot" /var/log/nginx/access.log \
  | awk '{ split($4, t, ":"); print substr($4,2,11)":"t[2] }' \
  | sort | uniq -c

Field notes

  • UA matching alone confirms the string, not the bot — for any access decision based on log data, verify IPs against the published IP range files or ASN lookup.
  • Bingbot covers both Microsoft Bing and Bing Copilot real-time retrieval — the same crawl pipeline, so filtering for Bingbot captures both.
  • CCBot (Common Crawl) is a frequent training-data source for multiple models, including earlier GPT versions — blocking or allowing it affects multiple downstream pipelines, not one.
  • Log rotation means gaps — if logs rotate daily, automate extraction to a persistent store; a 44-day look-back window catches most crawl cycle patterns for the major AI crawlers.
  • A high 404 rate from a training bot indicates stale crawl targets; a high 403 rate indicates a block that may not be reflected in robots.txt.
  • CDN edge logs miss origin-blocked requests — if you have edge-level bot blocks (WAF rules or Workers), check CDN and origin logs separately.