SEO

AI Crawling

> [!WARNING] > **Potential Bot Wall/Access Denied Page Detected!** > Reason: Found bot-wall signal: cloudflare

Why It Matters

As of 2025–2026, AI crawler traffic is growing rapidly as a share of total bot traffic, with training-purpose crawling accounting for approximately 80% of all AI bot activity. For content creators, AI Crawling is significant in two ways. First, you need to be able to control whether your content is used as training data for AI models without authorization. Second, if you want your content to be cited and surfaced in AI search engines (Perplexity, ChatGPT Search, Gemini, etc.), you must allow the relevant search crawlers to access your site. In other words, managing AI Crawling is a strategic challenge of balancing content protection with securing AI visibility (LLM Visibility).

Major AI Crawlers

As of 2026, the major AI crawlers, their operators, and primary purposes are as follows:

User-Agent Operator Primary Purpose
GPTBot OpenAI Model training data collection
OAI-SearchBot OpenAI ChatGPT search result generation
ChatGPT-User OpenAI Real-time page retrieval during user conversations
ClaudeBot Anthropic Model training data collection
Claude-SearchBot Anthropic Claude search result indexing
Claude-User Anthropic Real-time page retrieval for user queries
Google-Extended Google Gemini model training control token
PerplexityBot Perplexity Web crawling for AI search
CCBot Common Crawl Open web archive (used for training many AI models)
Bytespider ByteDance TikTok search and AI features
meta-externalagent Meta Meta AI feature support
Applebot-Extended Apple Apple Intelligence training
Amazonbot Amazon Alexa and Amazon AI services

Googlebot accounts for 38.7% of all AI-related bot requests, followed by GPTBot at 12.8%, meta-externalagent at 11.6%, and ClaudeBot at 11.4%, these four crawlers collectively represent approximately 74% of all AI bot traffic.

How to Allow or Block AI Crawlers

AI crawler access is controlled through the robots.txt file. Most major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) officially state that they comply with robots.txt directives.

Example: Blocking all AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

Example: Blocking training while allowing AI search visibility:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search/real-time retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Note that Google-Extended is a control token rather than a traditional crawler, so it does not appear directly in server logs. It is used to restrict Gemini training without blocking Googlebot itself.

Strategic Considerations

Trade-off between training blocking and AI search visibility: Blocking all AI crawlers wholesale protects your content but prevents it from being cited in AI search results. Selectively allowing access by distinguishing between training bots and search bots is the most recommended strategy as of 2026.

Regular audits are essential: AI companies frequently introduce new crawler User-Agents. When Anthropic consolidated its previous anthropic-ai and Claude-Web agents into ClaudeBot, sites that did not update their rules were inadvertently left accessible. You should review your robots.txt at least once per quarter.

Cloudflare Pay-per-Crawl: In July 2025, Cloudflare began blocking AI crawlers by default on new domains and launched Pay-per-Crawl in beta, an HTTP 402-based marketplace that allows site owners to receive micropayments of $0.01–$0.05 per AI bot crawl request. This has attracted attention as a new option for content monetization. In September 2025, Cloudflare followed with the Content Signals Policy, a robots.txt extension for declaring how content may be used (search, ai-input, ai-train). By June 2026, Cloudflare reported that automated traffic had overtaken human traffic at 57.5% of all HTTP requests, with crawl-to-referral ratios of roughly 857:1 for OpenAI and 11,000:1 for Anthropic, making the economics of allowing crawls an increasingly explicit consideration.

Server log monitoring: Even after configuring robots.txt, it is important to verify through server logs that crawlers are actually complying with your directives. Some smaller AI crawlers have been reported to ignore robots.txt, in which case firewall-level blocking may be necessary.

Related Powerblog Posts

  • What is llms.txt and why it matters for SEO
  • What is AI Search and how it's changing SEO

How Powerblog Helps

Powerblog's robots.txt allows search engine crawlers by default. Per-bot AI crawler settings (allow/block) can be managed through the dashboard's robots.txt editor.

Publish SEO-ready content with Powerblog

Powerblog helps teams plan, write, and publish optimized blog content that ranks — without the engineering overhead.

Start your free trial