Skip to content
We scrape data and audit scraping risk
DataCrawlPro
Crawler reference

AI crawler robots.txt reference

A practical, source-linked reference for website owners reviewing AI crawler visibility. Last reviewed: June 26, 2026.

Safe guidance only

Treat robots.txt as a signal, not a private-data protection layer.
Review official crawler documentation before copying rules from old lists.
Use authentication for private data instead of relying on hidden URLs.
Monitor logs and crawler behavior when public data is commercially valuable.
Separate search visibility decisions from AI-training or model-grounding decisions where the crawler operator supports that distinction.
Crawler names

AI crawler and robots.txt controls to review

Crawler names and policies change. Use these as starting points, then verify each official source before publishing new rules.

OpenAI crawlers

OpenAI documents separate crawlers for ChatGPT search surfaces, model-training collection, and user-triggered requests.

OAI-SearchBotGPTBotChatGPT-User

Review OpenAI's current documentation before changing robots.txt because blocking search and training crawlers can have different visibility tradeoffs.

Google AI and crawler controls

Google documents crawler tokens and notes that Google-Extended is a control token for certain Gemini and Vertex AI uses, not a separate HTTP user agent.

Google-ExtendedGooglebotGoogleOtherGoogle-CloudVertexBot

Do not assume Google-Extended blocks Google Search crawling. Check Google's crawler documentation before changing search-critical rules.

Anthropic Claude crawlers

Anthropic documents separate robots for model-development crawling, user-directed retrieval, and search-result quality.

ClaudeBotClaude-UserClaude-SearchBot

Blocking user or search bots may reduce the ability for Claude to retrieve or surface public pages in user workflows.

Common Crawl

Common Crawl documents CCBot as its crawler for building public web crawl datasets.

CCBot

Use the official CCBot reference and verification notes if you need to distinguish real Common Crawl requests from spoofed user agents.

Perplexity crawlers

Perplexity documents PerplexityBot as a crawler for surfacing and linking websites in Perplexity search results.

PerplexityBot

Review the official crawler page and current IP guidance before making allow or disallow decisions.

Apple crawler controls

Apple documents Applebot for search-related crawling and Applebot-Extended as an additional control for how content may be used by Apple.

ApplebotApplebot-Extended

Use Apple's own support page for current details because Applebot and Applebot-Extended can affect different purposes.

Meta crawlers

Meta documents crawler behavior for link previews and web crawling use cases, including Meta-ExternalAgent.

facebookexternalhitMeta-ExternalAgent

Check Meta's webmaster documentation before making crawler rules because preview, indexing, and AI-related use cases may differ.

FAQ

Crawler control questions

Use official documentation and practical controls together; do not treat crawler policy as security.

Does robots.txt guarantee AI crawlers will stay away?

No. robots.txt is a public preference file for cooperative crawlers. It is useful, but private data should still be protected with authentication, access control, monitoring, and server-side controls.

Should I block every AI crawler?

Not automatically. Some crawler controls affect search visibility, user-triggered retrieval, model training, or product-specific features differently. Review each official source before changing rules.

Can DataCrawlPro review my AI crawler exposure?

Yes. DataCrawlPro can review public crawler visibility, robots.txt signals, llms.txt clarity, public structured data, and practical exposure controls.

Related services

Service pages connected to this resource

These pages explain how DataCrawlPro scopes public or authorized data extraction, Python scripts, scraping exposure audits, pricing, and contact review.

Contact DataCrawlPro
Link-worthy resources

More public resources to cite or share

These resources are designed to be useful on their own: calculators, checklists, glossary entries, crawler references, and sample audit material.

Web Scraping Cost Calculator

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource

Website Scraping Risk Checklist

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource

AI Crawler robots.txt Reference

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource

Public Data Exposure Glossary

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource

Web Scraping Comparison Guides

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource

Sample Website Scraping Risk Audit Report

A public DataCrawlPro resource for planning, evaluation, responsible-use review, or website-owner education.

Open Resource