Platform

AEO Website Research-grade Content Content Factory About Audits Rankings Pricing

Resources

Knowledge Base Research FAQ
AEO Scoring Criteria Criterion AD3

robots.txt for AI: Rolling Out the Red Carpet (or Slamming the Door)

Most sites run default platform robots.txt with zero AI-specific rules. That's not a strategy - it's an accident. Explicit Allow rules for GPTBot, ClaudeBot, and PerplexityBot signal that your content is open for citation.

One of 48 criteria in AEO Rank, the citation-readiness score we run against every site we audit.

By Alex Shortov

low effort low impact

Quick Answer

Add explicit Allow rules in your robots.txt for every AI crawler that ships its own user agent: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Perplexity-User, Google-Extended, Bingbot, Applebot, Applebot-Extended, Meta-ExternalAgent, and CCBot. Without these rules, AI systems do not know if you want to be cited. This takes 5 minutes and it is one of the 48 criteria we score in every audit.

Audit Note

In our audits, we've measured robots.txt for AI: Rolling Out the Red Carpet on live sites, we've compared implementations, and we've audited the gaps that keep scores low.

How do I set up robots.txt to allow AI crawlers like GPTBot and ClaudeBot?

Add explicit Allow rules for GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, Applebot, Meta AI, and CCBot in robots.txt.

Should I block or allow AI bots in my robots.txt file?

Allow named AI bots explicitly because blanket blocks create complete AI invisibility, while clear allows signal deliberate participation and lift your AI Discovery pillar.

What AI crawlers exist and which ones should my site allow access to?

Major AI crawlers include GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, Bingbot, Meta-ExternalAgent, and CCBot, all of which should get Allow rules.

Summarize This Article With AI

Open this article in your preferred AI engine for an instant summary and analysis.

AI Crawler Access Directives
GPTBot - allowed
ClaudeBot - allowed
PerplexityBot - allowed
Google-Extended - allowed
Applebot-Extended - allowed
CCBot - allowed
aeocontent.ai
AI Crawler Access Directives. Infographic illustrating the AEO Rank criterion discussed in this article.
Video walkthrough: robots.txt for AI: Rolling Out the Red Carpet (or Slamming the Door). Companion to the AEO Rank criterion explained below.

What this article answers

  • How do I set up robots.txt to allow AI crawlers like GPTBot and ClaudeBot?
  • Should I block or allow AI bots in my robots.txt file?
  • What AI crawlers exist and which ones should my site allow access to?

Key takeaways

  • Add explicit Allow rules for every documented AI agent: OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User), Anthropic (ClaudeBot, Claude-Web, anthropic-ai), Perplexity (PerplexityBot, Perplexity-User), Google (Google-Extended, Googlebot), Bing, Apple (Applebot, Applebot-Extended), Meta AI, and Common Crawl (CCBot).
  • Leave undocumented or controversial scrapers under the User-agent: * wildcard rather than naming them - listing a specific Disallow can backfire if the bot ignores it (Bytespider is the canonical example).
  • Always include a Sitemap reference in your robots.txt so crawlers can navigate your site efficiently.
  • Remember robots.txt is advisory, not enforced - it is a communication tool, not a security measure.

Which AI Crawlers Are Trying to Access Your Site?

Six AI-specific crawlers (GPTBot, CCBot, Google-Extended, PerplexityBot, Anthropic-AI, Bytespider) request content for training and retrieval, each controllable via robots.txt rules.

robots.txt is a text file at your domain root (example.com/robots.txt) that tells crawlers what they can access. Simple concept. But the crawler landscape has shifted dramatically - there’s now a whole category of AI-specific bots that collect content for training and retrieval.

Here’s who’s knocking:

  • GPTBot - OpenAI (ChatGPT, GPT-based products)
  • CCBot - Common Crawl (feeds into many AI training sets)
  • Google-Extended - Google (Gemini, AI Overviews)
  • PerplexityBot - Perplexity AI
  • Anthropic-AI - Anthropic (Claude)
  • Bytespider - ByteDance (TikTok’s AI features)

Your robots.txt can explicitly Allow or Disallow each one. That’s granular control over which AI systems can use your content - and which can’t.

Six AI-specific crawlers request content for training and retrieval, and each maps to a different downstream engine.

CrawlerVendorEngine Powered
GPTBotOpenAIChatGPT
CCBotCommon CrawlMany training sets
Google-ExtendedGoogleGemini, AI Overviews
PerplexityBotPerplexityPerplexity
Anthropic-AIAnthropicClaude
BytespiderByteDanceTikTok AI features

Why Is Having No AI Crawler Policy the Worst Option?

Default robots.txt files send no signal of AI-friendliness, leaving each bot’s behavior to its own defaults and surrendering your control over what gets cited.

We check robots.txt on every audit. Here’s what we find 80% of the time: the default robots.txt from Shopify, WordPress, or whatever platform the site runs on. Zero AI-specific rules. Zero intentionality.

This creates two problems:

First - no signal of AI-friendliness. When you explicitly Allow AI crawlers, you’re telling these systems “my content is available and welcome for citation.” That’s a signal. Default configs send no signal at all.

Second - no control. Without explicit rules, you’re leaving it up to each bot’s default behavior. Some crawl everything. Some play it safe and skip you. You have no say.

For AEO, the strategic play is clear: Allow AI crawlers on content you want cited (blog posts, product pages, FAQ, knowledge base) and block areas that don’t need indexing (admin panels, checkout flows, internal tools).

This criterion carries 1% raw weight in the AI Discovery pillar of our scoring model. That sounds small, but it gates whether any of the other AI Discovery work shows up at all - a blocked robots.txt makes every other discovery signal moot.

How Do You Configure robots.txt for AI Crawlers?

Add these rules to your robots.txt. This is the full 2026 AEO allow-list - every major AI vendor’s documented user agent, including their separate live-search and user-triggered variants:

User-agent: *
Allow: /

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: anthropic-ai
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Google (Bard / Gemini / AI Overviews opt-in)
User-agent: Google-Extended
Allow: /

User-agent: Googlebot
Allow: /

# Microsoft Bing (powers ChatGPT Search)
User-agent: Bingbot
Allow: /

# Apple Intelligence + Spotlight
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

# Meta AI
User-agent: Meta-ExternalAgent
Allow: /

# Common Crawl (powers many open-source training sets)
User-agent: CCBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

A few important notes:

  • Two user agents per vendor: OpenAI has GPTBot (training) plus OAI-SearchBot and ChatGPT-User (live retrieval). Anthropic has ClaudeBot plus the legacy Claude-Web / anthropic-ai. Perplexity has PerplexityBot plus Perplexity-User. Listing only the training crawler is the most common mistake - it cuts you out of live citation answers.
  • Apple ships two agents too: Applebot powers Spotlight and Siri search. Applebot-Extended is the separate opt-in token for training Apple Intelligence. Apple treats these as distinct decisions.
  • Bytespider and Grok crawler: neither vendor publishes documented allow/disallow guidance and Bytespider has been observed ignoring robots.txt directives. Most operators leave them under the User-agent: * blanket rather than naming them - same effective allow, fewer surprises if the UA token changes.
  • Crawl-delay: leaving it off (as we do) is fine for most sites. Add Crawl-delay: 2 only if you actually see AI crawlers stressing your server.

Shopify: Edit robots.txt.liquid in your theme (Online Store > Themes > Edit code > Templates > robots.txt.liquid).

WordPress: Use Yoast SEO to edit robots.txt, or edit the file directly in your root directory.

Next.js / static sites: Create a public/robots.txt file or generate it dynamically via an API route. Our site generates it dynamically.

Start here: Open yoursite.com/robots.txt right now. If you don’t see GPTBot or ClaudeBot mentioned, you’ve got work to do. Five minutes of work.

What robots.txt Mistakes Should You Avoid?

Blanket-blocking AI bots, forgetting the Sitemap directive, writing overly broad Disallow rules, and treating robots.txt as security each undermine your AEO citation surface.

Blocking all AI crawlers. We’ve seen this - sites that blanket-block every AI bot thinking they’re “protecting their content.” The result? Complete AI invisibility. Nobody’s citing you. Nobody’s recommending you. That’s not protection - that’s disappearance.

Forgetting robots.txt is advisory, not enforced. Well-behaved bots follow it. Malicious scrapers don’t. It’s not a security measure - it’s a communication tool.

Missing the Sitemap reference. Always include Sitemap: https://yoursite.com/sitemap.xml in your robots.txt. It’s the roadmap that makes crawling efficient.

Overly broad Disallow rules. Disallow: /blog when you meant to block /blog/drafts - now your entire blog is invisible to AI. Be specific.

Not testing after changes. A typo in robots.txt can accidentally block your entire site. Use Google’s robots.txt tester before deploying changes.

Platform limitations. Shopify’s robots.txt is partially platform-managed. Know what you can and can’t control on your stack.

Score Impact in Practice

Robots.txt AI directives carry 1% raw weight but reach 7-9/10 with five minutes of work, while 80% of audited sites still ship default platform configurations.

The AI Crawler Directives criterion carries 1% raw weight in the AI Discovery pillar. Sites with explicit Allow rules for AI crawlers score 7-9/10 on this criterion. Sites with default platform robots.txt files that contain no AI-specific rules score 3-4/10. Sites that actively block AI crawlers score 0/10.

In practice, robots.txt is one of the lowest-effort, lowest-risk criteria to max out. It takes 5 minutes to add the rules, there’s virtually no downside, and it signals intentionality to every AI engine that checks. Despite this, roughly 80% of the sites we audit have no AI-specific rules in their robots.txt.

Among Y Combinator startups we’ve benchmarked, adoption is slightly higher - about 30% have explicit AI crawler rules. But even in this technically sophisticated cohort, the majority run default platform configs. The sites that do include AI directives tend to score higher across all technical criteria, not because robots.txt directly improves other scores, but because attention to AI crawlers correlates with attention to crawlability in general.

How AI Engines Evaluate This

Each AI crawler checks robots.txt before crawling any page on your site. The behavior on finding (or not finding) specific rules varies by engine.

GPTBot (OpenAI) respects robots.txt strictly. If your robots.txt has no GPTBot-specific rule, GPTBot falls back to the general User-agent: * rules. If those rules allow access, GPTBot will crawl - but without an explicit Allow, OpenAI’s systems treat your content access permission as ambiguous. An explicit User-agent: GPTBot / Allow: / removes that ambiguity and signals that you welcome AI indexing.

ClaudeBot (Anthropic) checks for both anthropic-ai and ClaudeBot user-agent strings. Anthropic has been particularly careful about respecting opt-out signals. If your robots.txt blocks either user-agent string, Claude will not use your content for responses. The flip side: an explicit Allow for anthropic-ai is a positive signal that feeds into Anthropic’s source confidence scoring.

PerplexityBot checks robots.txt and also looks for a Crawl-delay directive. Perplexity’s crawler is high-frequency because it builds answers in real time, so the Crawl-delay value matters more here than for training-focused crawlers. Setting Crawl-delay: 2 prevents server overload while keeping the door open for citation.

Google-Extended is the user-agent for Google’s generative AI features (Gemini, AI Overviews). It’s separate from Googlebot (which handles traditional search). You can allow Googlebot for search indexing while blocking Google-Extended for AI training, or vice versa. Most sites benefit from allowing both, but the distinction gives you granular control if you want AI Overviews visibility without contributing to Gemini’s training data.

External Resources

Key takeaways

  • Add explicit Allow rules for every documented AI agent: OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User), Anthropic (ClaudeBot, Claude-Web, anthropic-ai), Perplexity (PerplexityBot, Perplexity-User), Google (Google-Extended, Googlebot), Bing, Apple (Applebot, Applebot-Extended), Meta AI, and Common Crawl (CCBot).
  • Leave undocumented or controversial scrapers under the User-agent: * wildcard rather than naming them - listing a specific Disallow can backfire if the bot ignores it (Bytespider is the canonical example).
  • Always include a Sitemap reference in your robots.txt so crawlers can navigate your site efficiently.
  • Remember robots.txt is advisory, not enforced - it is a communication tool, not a security measure.

Related FAQs

Technical Implementation
Getting Started
Startups & Accelerators