Log Parsing Regex Snippets
Last reviewed:
These commands assume Nginx combined log format. Adapt the log path and field positions for Apache, Caddy, or CDN log exports. For ongoing monitoring rather than one-off queries, use the parse-logs-for-ai-bot-behavior playbook.
grep extraction patterns
# Extract all AI crawler lines from a combined access log
# Replace /var/log/nginx/access.log with your actual log path
# All known AI crawlers in one pass
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|Google-Extended|ClaudeBot|Applebot-Extended|CCBot|Amazonbot|YouBot|Bytespider|DuckAssistBot|Meta-ExternalAgent|Meta-ExternalFetcher" /var/log/nginx/access.log > ai-bots.log
# Training crawlers only
grep -E "GPTBot|Google-Extended|ClaudeBot|Applebot-Extended|CCBot|Bytespider" /var/log/nginx/access.log > training-bots.log
# Search/retrieval crawlers only
grep -E "OAI-SearchBot|PerplexityBot|Bingbot|YouBot|DuckAssistBot|Amazonbot" /var/log/nginx/access.log > search-bots.log
# Count hits per crawler (sort descending)
grep -oE "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|Google-Extended|ClaudeBot|CCBot|Amazonbot|YouBot|Bytespider|DuckAssistBot" /var/log/nginx/access.log \
| sort | uniq -c | sort -rn
# Count hits per crawler by response code
awk '/GPTBot|OAI-SearchBot|PerplexityBot/ {print $9, $0}' /var/log/nginx/access.log \
| grep -oE "^[0-9]+" \
| sort | uniq -c | sort -rn
awk analysis patterns
# Unique URLs crawled by GPTBot
# Nginx default log format: $remote_addr - $remote_user [$time_local] "$request" $status ...
awk '/GPTBot/ {
split($7, a, "?");
print a[1]
}' /var/log/nginx/access.log | sort -u
# Hourly request rate for OAI-SearchBot
awk '/OAI-SearchBot/ {
split($4, t, ":");
printf "%s %s\n", substr($4,2,11), t[2]
}' /var/log/nginx/access.log | sort | uniq -c
# 4xx and 5xx responses served to AI crawlers
awk '($0 ~ /GPTBot|OAI-SearchBot|PerplexityBot|Bingbot|ClaudeBot/) && ($9 ~ /^[45]/) {
print $9, $7
}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# Crawl rate by hour for all AI bots combined
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|Bingbot" /var/log/nginx/access.log \
| awk '{ split($4, t, ":"); print substr($4,2,11)":"t[2] }' \
| sort | uniq -c
Field notes
- UA matching alone confirms the string, not the bot — for any access decision based on log data, verify IPs against the published IP range files or ASN lookup.
Bingbotcovers both Microsoft Bing and Bing Copilot real-time retrieval — the same crawl pipeline, so filtering forBingbotcaptures both.CCBot(Common Crawl) is a frequent training-data source for multiple models, including earlier GPT versions — blocking or allowing it affects multiple downstream pipelines, not one.- Log rotation means gaps — if logs rotate daily, automate extraction to a persistent store; a 44-day look-back window catches most crawl cycle patterns for the major AI crawlers.
- A high 404 rate from a training bot indicates stale crawl targets; a high 403 rate indicates a block that may not be reflected in robots.txt.
- CDN edge logs miss origin-blocked requests — if you have edge-level bot blocks (WAF rules or Workers), check CDN and origin logs separately.