Machine-Readable Infrastructure

Edge Worker Bot Management

Last reviewed:

Edge rules execute before origin and before robots.txt is fetched — a 403 here is final and overrides any robots.txt allow directive, regardless of what the file says. Use deliberately, and only when robots.txt genuinely isn’t enough (ignoring bots, rate-limiting before origin degradation, differential responses to bot UAs, routing bots to a cached version).

Cloudflare Worker

// Edge Worker Bot Management
// Deploy as a Cloudflare Worker. Adapt for Fastly VCL or Akamai EdgeWorkers.

// CONFIGURE: UA substrings for training crawlers you want to block
const TRAINING_BOTS = [
  "GPTBot",
  "Google-Extended",
  "ClaudeBot",
  "Applebot-Extended",
  "CCBot",
  "FacebookBot",
  "anthropic-ai"
];

// CONFIGURE: UA substrings for AI search crawlers you want to allow
const SEARCH_BOTS = [
  "OAI-SearchBot",
  "PerplexityBot",
  "YouBot",
  "Amazonbot"
];

// CONFIGURE: "block" | "allow" | "redirect"
const TRAINING_ACTION = "block";
const SEARCH_ACTION   = "allow";
const REDIRECT_URL    = "https://example.com/bot-policy"; // used only when action = "redirect"

export default {
  async fetch(request) {
    const ua = request.headers.get("User-Agent") || "";

    const isTraining = TRAINING_BOTS.some(bot => ua.includes(bot));
    const isSearch   = SEARCH_BOTS.some(bot => ua.includes(bot));

    if (isTraining) {
      if (TRAINING_ACTION === "block")    return new Response("Forbidden", { status: 403 });
      if (TRAINING_ACTION === "redirect") return Response.redirect(REDIRECT_URL, 301);
    }

    if (isSearch) {
      if (SEARCH_ACTION === "block")    return new Response("Forbidden", { status: 403 });
      if (SEARCH_ACTION === "redirect") return Response.redirect(REDIRECT_URL, 301);
    }

    // Default: pass through to origin
    return fetch(request);
  }
};

Fastly VCL snippet

// Fastly VCL equivalent — add to vcl_recv
sub vcl_recv {
  // Block training crawlers
  if (req.http.User-Agent ~ "GPTBot|Google-Extended|ClaudeBot|Applebot-Extended|CCBot") {
    error 403 "Forbidden";
  }
}

Field notes

  • UA-only matching is spoofable — for high-trust decisions, combine UA matching with IP range verification; all major AI crawlers publish IP range files or ASNs.
  • Separate training and search bot lists — GPTBot (training) and OAI-SearchBot (ChatGPT Search real-time) are different tokens requiring independent decisions; conflating them is the most common misconfiguration.
  • Keep the UA string list current — outdated lists silently fail to catch new or renamed crawlers.
  • Log 403s from this layer separately from origin 403s — edge-level blocks are invisible to most server-side log analysis unless explicitly forwarded to a logging sink.
  • Akamai EdgeWorkers uses the same fetch/response pattern; UA access is request.getHeaders().get("User-Agent").