Using robots.txt and meta tags for AI – A Simple First Step Toward Ethical Data Use
AI models today are often trained on vast datasets scraped from the open web, including blogs, code, art, music, and written works. This content is used without the creators’ knowledge, consent, or compensation.
While the technology advances rapidly, the ethical frameworks and consent mechanisms have lagged far behind. But there’s a simple, low-friction solution we can begin with. We can adapt the already well-known and widely respected robots.txt standard for AI training use.
This paper proposes a new convention for signaling whether content may be used for training AI systems. It builds on existing infrastructure to establish clear, public boundaries.
The Problem
- Creators’ content is being used to train AI without permission
- AI tools are beginning to displace those same creators in the market
- There is currently no standardized mechanism to opt out of training dataset
- Legal protections are slow, unclear, or nonexistent
- Trust is eroding
The Precedent: robots.txt
Since the 1990s, websites have used the robots.txt file to instruct search engines and crawlers on how to interact with their content. For example:
User-agent: *
Disallow: /private/
Most major search engines (Google, Bing, etc.) respect this file. While it’s not enforceable in court, it forms a widely respected social and technical norm.
Proposal: AI-Respecting robots.txt
Lets extend the robots.txt model to include AI-focused directives. For example:
Why This Matters
Using robots.txt and meta tags is a conversation starter. It’s a small step on the way to a better system.
-
Creators get a voice without sweeping regulation in a simple pre-existing framework
-
AI companies can demonstrate good faith by honoring boundaries using a system that’s already accepted
-
It creates a timestamped record of intent that is potentially useful in future legal or ethical disputes
-
It helps establish a social contract between creators and AI developers
-
It can evolve into more enforceable frameworks or licensing protocols
What Comes Next
-
Encourage platforms (WordPress, Ghost, Medium) to support automated robots.text or meta tag AI directives
-
Ask AI companies to publicly commit to respecting opt-outs
-
Build registries or tools for creators to easily manage their AI access preferences
-
Explore legislation that recognizes robots.txt and meta tag signals as valid content boundaries
Call to Action
Creators: Start adding AI robots.txt directives and <meta> tags to your pages or web content. Adding extra lines to robots.txt or custom <meta> tags won’t break anything. They’ll simply be ignored by agents that don’t recognize them.
Developers: Help build tooling to make this easy and visible. For example, update existing plugins, SEO tools, generators with potential tags and robots directives.
AI Companies: Respect the signals. Seek consent. Compensate when appropriate. This will improve your reputation and goodwill with the people you depend on.
Policymakers: Recognize that the consent of creators matters and simple technical tools can help codify it. Help protect the small creators against the power and might of big corporations.
Known AI Training Bots and Their User-Agents
Bot Name | User-Agent | Organization | Notes |
---|---|---|---|
GPTBot | GPTBot |
OpenAI | Used for ChatGPT data collection |
CCBot | CCBot |
Common Crawl | Used by many AI orgs for training |
AnthropicBot | AnthropicBot |
Anthropic | Claude’s training data collector |
Google-Extended | Google-Extended |
Used to control Bard/Gemini access | |
ClaudeBot | ClaudeBot |
Anthropic | May be used in future, unclear naming consistency |
AnyBot | Anybot / AnyaiBot |
AnyAI (startup) | Data access for smaller AI orgs |
MistralBot | MistralBot |
Mistral AI | European AI startup |
YouBot | YouBot |
You.com | AI-powered search engine |
MetaBot | MetaAI or facebookexternalhit |
Meta | Used for LLaMA / image-text scraping |
PerplexityBot | PerplexityBot |
Perplexity AI | Crawler used for answer engine |
NeevaBot | NeevaBot |
Neeva (acquired by Snowflake) | Previously search/AI training |
PhindBot | PhindBot |
Phind | Developer-oriented search AI |
KomoBot | KomoBot |
Komo AI | Search + generative AI |
YandexGPT | YandexGPTBot |
Yandex | GPT-style Russian model training |
DeepSeekBot | DeepSeekBot |
DeepSeek | New AI research org |
AmazonBot | Amazonbot |
Amazon | May support internal AI work |
Sample robots.txt for AI Training Crawler Control
# Block AI training bots by default
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: AnthropicBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: MetaAI
Disallow: /
User-agent: facebookexternalhit
Disallow: /
User-agent: Anybot
Disallow: /
User-agent: AnyaiBot
Disallow: /
User-agent: MistralBot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: NeevaBot
Disallow: /
User-agent: PhindBot
Disallow: /
User-agent: KomoBot
Disallow: /
User-agent: YandexGPTBot
Disallow: /
User-agent: DeepSeekBot
Disallow: /
User-agent: Amazonbot
Disallow: /
# Catch-all rule for unknown AI training bots
User-agent: *AI*
Disallow: /
User-agent: *bot
Disallow: /
User-agent: *crawler
Disallow: /
# Optional: allow search engine crawlers only
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: DuckDuckBot
Disallow:
User-agent: Applebot
Disallow:
WordPress Function to add a blanket ai-train meta tag
Here’s a simple WordPress-compatible function you can add to your theme’s functions.php file to inject a custom meta tag into the <head> of every page or post on your site. You can modify it to be a bit more nuanced if you want to make per-post rules using post meta, but for simplicity, this version is a site-wide default “No”.
function add_ai_train_meta_tag() {
echo '<meta name="ai-train" content="no">' . "\n";
}
add_action('wp_head', 'add_ai_train_meta_tag');