AI crawler traffic has exploded. According to Cloudflare's 2025 data, GPTBot requests increased 305% year-over-year, jumping from the #9 crawler to #3. PerplexityBot saw even more dramatic growth: a 157,490% increase in raw requests.
These bots power the AI systems that increasingly determine whether your products get recommended. Understanding how they work—and how they differ from traditional search crawlers—is essential for AI visibility.
The AI Crawler Landscape
Who's Crawling Your Site?
The major AI crawlers active in 2025:
| Bot | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training + ChatGPT Search |
| ChatGPT-User | OpenAI | Real-time ChatGPT browsing |
| ClaudeBot | Anthropic | Claude training data |
| PerplexityBot | Perplexity | Real-time search results |
| Bytespider | ByteDance | TikTok AI features |
| Applebot-Extended | Apple | Apple Intelligence features |
| Meta-ExternalAgent | Meta | AI training |
| Google-Extended | Gemini training |
Traffic Volume by Bot
Cloudflare's data on AI crawler growth (May 2024 to May 2025):
| Bot | Growth | Current Share |
|---|---|---|
| GPTBot | +305% | Jumped from #9 to #3 |
| ChatGPT-User | +2,825% | 1.3% share |
| PerplexityBot | +157,490% | 0.2% share (but massive growth) |
| ClaudeBot | Peaked mid-year | Dropped 46%, from 11.7% to 5.4% |
The pattern varies by bot: GPTBot and PerplexityBot are accelerating, while ClaudeBot has pulled back after aggressive early-year crawling.
How AI Crawlers Differ from Googlebot
Crawling Behavior
Googlebot:
- Systematic, ongoing crawling
- Respects crawl-delay directives
- Focuses on indexing all content
- Updates based on content freshness signals
- Well-established, predictable patterns
AI Crawlers:
- More sporadic, burst-oriented crawling
- Variable respect for crawl-delay
- Focused on extracting training data
- Less predictable patterns
- Some (like ChatGPT-User) crawl in real-time per user query
What They Extract
Googlebot indexes content for search rankings. AI crawlers extract content for:
- Training data (GPTBot, ClaudeBot, Meta-ExternalAgent)
- Real-time answers (ChatGPT-User, PerplexityBot)
- Knowledge synthesis (combining your content with other sources)
This means AI bots care about:
- Clean, extractable text content
- Structured data they can parse
- Factual, citable information
- Clear attribution signals
Crawl-to-Refer Ratios
A critical metric: how much crawling generates how much referral traffic?
| Platform | Crawl-to-Refer Ratio | Meaning |
|---|---|---|
| Perplexity | Under 200:1 | 200 crawl requests per 1 referral visit |
| OpenAI | Up to 3,700:1 | 3,700 crawl requests per 1 referral visit |
| Anthropic | 25,000:1 to 100,000:1 | Mostly training, minimal referrals |
Why this matters:
Perplexity's low ratio means they efficiently convert crawling into actual traffic for publishers. They crawl your content and send visitors back.
Anthropic's high ratio means they're primarily extracting training data—lots of crawling, minimal direct traffic. Your content improves Claude, but users don't click through to you.
OpenAI is in the middle, with ChatGPT Search driving more referrals than pure training crawls.
The Blocking Debate
Current Blocking Trends
AI bots face more aggressive blocking than traditional search crawlers:
- AI crawlers have the highest number of fully disallowed directives
- Googlebot and Bingbot are typically restricted on specific paths, not sitewide
- News publishers and content creators are most likely to block
Should You Block?
Arguments for blocking:
- Protect training data: You may not want your content used to train AI models
- Reduce server load: AI crawlers can be aggressive
- Control licensing: Some publishers want payment for AI training use
- Protect competitive advantage: Unique content helps AI competitors
Arguments against blocking:
- Lose AI visibility: Blocked sites can't be cited in AI answers
- Miss referral traffic: Especially from Perplexity's efficient referral model
- Remove from recommendations: AI can't recommend what it can't see
- Competitive disadvantage: Competitors who allow crawling get visibility you don't
The Visibility Trade-off
For e-commerce and most businesses, blocking AI crawlers is usually counterproductive:
Block AI crawlers → AI can't access your content
→ AI can't cite or recommend you
→ Users asking AI for product recommendations never see you
→ Competitors who allow crawling capture that visibility
Unless you have specific concerns about training data extraction (news publishers, content licensing businesses), the visibility benefits typically outweigh the costs.
Robots.txt Configuration
Identifying AI Bots
The major AI user-agents to know:
# OpenAI
User-agent: GPTBot
User-agent: ChatGPT-User
# Anthropic
User-agent: ClaudeBot
User-agent: anthropic-ai
# Perplexity
User-agent: PerplexityBot
# Google AI
User-agent: Google-Extended
# Meta
User-agent: Meta-ExternalAgent
# Apple
User-agent: Applebot-Extended
# ByteDance
User-agent: Bytespider
Allow All (Recommended for Most)
For maximum AI visibility:
# Allow all AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
Or simply don't mention them—absence of disallow means allow.
Block Training, Allow Search
Some businesses want to appear in AI search but not contribute to training data. This is tricky because:
- ChatGPT-User = real-time search (you probably want this)
- GPTBot = training + search preparation (harder to separate)
- PerplexityBot = real-time search (you probably want this)
- ClaudeBot = primarily training
A nuanced approach:
# Block pure training bots
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Allow search-focused bots
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
Important caveat: This distinction isn't perfect. GPTBot serves both training and search preparation, and the line is blurry.
Selective Path Blocking
Block specific sections while allowing others:
User-agent: GPTBot
Disallow: /admin/
Disallow: /internal-docs/
Allow: /products/
Allow: /blog/
Allow: /
This protects sensitive areas while allowing AI access to public content.
Technical Optimization for AI Crawlers
Page Speed
AI crawlers, like all bots, have limited crawl budgets. Faster pages mean:
- More pages crawled per session
- More content extracted
- Better coverage of your site
Optimization priorities:
- Server response time under 200ms
- Total page load under 3 seconds
- Efficient asset delivery (CDN, compression)
JavaScript Rendering
AI crawlers vary in JavaScript handling:
| Bot | JS Rendering |
|---|---|
| GPTBot | Limited—prefers pre-rendered content |
| PerplexityBot | Limited—focuses on HTML content |
| Googlebot | Full rendering capability |
For AI visibility:
- Use server-side rendering (SSR) or static generation
- Ensure critical content is in initial HTML
- Don't hide important content behind JavaScript interactions
- Test with JavaScript disabled to see what AI crawlers see
Content Accessibility
Make content easily extractable:
Do:
- Use semantic HTML (h1, h2, p, article, etc.)
- Include structured data (Schema.org)
- Keep important content in text (not images)
- Use clear, descriptive headings
Don't:
- Hide content in tabs/accordions for initial load
- Put key information only in images
- Use heavy JavaScript for basic content
- Rely on client-side rendering for main content
Crawl Efficiency
Help bots find what matters:
Sitemap.xml:
- Include all important pages
- Update when content changes
- Prioritize high-value pages
- Use lastmod dates accurately
Internal linking:
- Link to important pages from high-traffic pages
- Use descriptive anchor text
- Create clear site hierarchy
- Avoid orphan pages
Monitoring AI Crawler Activity
Server Log Analysis
Your server logs show AI crawler behavior:
# Find GPTBot requests
grep "GPTBot" access.log | wc -l
# See what GPTBot crawls most
grep "GPTBot" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
# Check crawl frequency over time
grep "GPTBot" access.log | awk '{print $4}' | cut -d: -f1 | uniq -cWhat to look for:
- Which pages are crawled most?
- Is crawl frequency increasing or decreasing?
- Are important pages being missed?
- Any errors (4xx, 5xx) affecting AI crawlers?
Robots.txt Testing
Verify your robots.txt works as intended:
Google's robots.txt tester: https://www.google.com/webmasters/tools/robots-testing-tool
OpenAI's guidance: https://platform.openai.com/docs/gptbot
Test with different user-agents to ensure correct allow/disallow behavior.
Crawl-to-Visibility Correlation
Track whether AI crawling translates to visibility:
- Monitor AI crawler activity in logs
- Track your citations in ChatGPT/Perplexity
- Look for correlation between crawl coverage and citation frequency
If AI bots are crawling but you're not getting cited, the issue is content quality or authority—not crawler access.
Common Mistakes
1. Blocking All Bots
Some sites use overly aggressive blocking:
# DON'T DO THIS (blocks everything)
User-agent: *
Disallow: /
This blocks Googlebot, AI bots, and all legitimate crawlers.
2. Assuming Compliance
Robots.txt is a voluntary protocol. Well-established companies (Google, OpenAI, Anthropic) generally comply. Lesser-known or poorly-designed bots may ignore it entirely.
For actual security, you need:
- Authentication for sensitive content
- Rate limiting at the server level
- Firewall rules for specific IPs/user-agents
3. Forgetting ChatGPT-User
Many sites block GPTBot but forget ChatGPT-User:
User-agent: GPTBot
Disallow: /
# But ChatGPT-User is still allowed!
When ChatGPT users browse the web, they use ChatGPT-User agent. If you want to block OpenAI entirely, block both.
4. Not Testing Changes
After editing robots.txt:
- Verify syntax is correct
- Test with multiple user-agents
- Monitor for unintended blocking
- Check server logs for crawler behavior changes
See How AI Crawlers View Your Site
PageX analyzes your site from an AI crawler perspective—identifying accessibility issues, content extraction problems, and optimization opportunities.
Frequently Asked Questions
Do AI crawlers respect robots.txt?
Major AI companies (OpenAI, Anthropic, Google, Perplexity) generally respect robots.txt. However, compliance is voluntary—there's no enforcement mechanism. Lesser-known bots may ignore directives entirely.
Will blocking AI crawlers hurt my Google rankings?
No, blocking AI crawlers (GPTBot, ClaudeBot) doesn't affect Google rankings. These are separate from Googlebot. However, blocking Google-Extended might affect Gemini visibility.
How often do AI crawlers visit my site?
It varies significantly by site authority and content freshness. High-authority sites with frequent updates may see daily visits. Smaller sites might see weekly or less frequent crawling.
Can I charge AI companies for crawling my content?
Some publishers are pursuing licensing deals with AI companies. For most businesses, this isn't practical—the value is in visibility, not licensing fees. Major publishers (news organizations, content platforms) have more leverage for licensing negotiations.
Should I add AI-specific structured data?
Standard Schema.org markup serves both traditional search and AI systems. There's no special "AI-only" structured data needed. Focus on comprehensive implementation of Product, FAQ, HowTo, and other relevant schemas.
Related Reading
- The State of AI Search in 2025 - Market data on AI search growth and adoption
- Schema Markup for AI Search - Structured data that helps AI extract your content
- 10 GEO Mistakes That Make Your Store Invisible - Common technical errors including crawler blocking
- Measuring AI Search Success - Tools to track if optimization is working
- How to Rank in ChatGPT Search - Platform-specific optimization tactics