Using robots.txt and meta tags for AI – A Simple First Step Toward Ethical Data Use

May 12, 2025

AI models today are often trained on vast datasets scraped from the open web, including blogs, code, art, music, and written works. This content is used without the creators’ knowledge, consent, or compensation.

While the technology advances rapidly, the ethical frameworks and consent mechanisms have lagged far behind. But there’s a simple, low-friction solution we can begin with. We can adapt the already well-known and widely respected robots.txt standard for AI training use.

This paper proposes a new convention for signaling whether content may be used for training AI systems. It builds on existing infrastructure to establish clear, public boundaries.

The Problem

Creators’ content is being used to train AI without permission
AI tools are beginning to displace those same creators in the market
There is currently no standardized mechanism to opt out of training dataset
Legal protections are slow, unclear, or nonexistent
Trust is eroding

The Precedent: robots.txt

Since the 1990s, websites have used the robots.txt file to instruct search engines and crawlers on how to interact with their content. For example:

User-agent: *
Disallow: /private/

Most major search engines (Google, Bing, etc.) respect this file. While it’s not enforceable in court, it forms a widely respected social and technical norm.

Proposal: AI-Respecting robots.txt

Lets extend the robots.txt model to include AI-focused directives. For example:

User-agent: gptbot
Disallow: /

User-agent: anthropicbot
Disallow: /

User-agent: anyai
Disallow: /

Or for selective allowance, something like:

User-agent: gptbot
Disallow: /drafts/
Allow: /articles/

For page-level control, HTML metadata could be used. For example:

<meta name="ai-train" content="no">
<meta name="ai-train" content="yes; attribution=required">

This approach complements robots.txt and supports more granular decisions.

Why This Matters

Using robots.txt and meta tags is a conversation starter. It’s a small step on the way to a better system.

Creators get a voice without sweeping regulation in a simple pre-existing framework
AI companies can demonstrate good faith by honoring boundaries using a system that’s already accepted
It creates a timestamped record of intent that is potentially useful in future legal or ethical disputes
It helps establish a social contract between creators and AI developers
It can evolve into more enforceable frameworks or licensing protocols

What Comes Next

Encourage platforms (WordPress, Ghost, Medium) to support automated robots.text or meta tag AI directives
Ask AI companies to publicly commit to respecting opt-outs
Build registries or tools for creators to easily manage their AI access preferences
Explore legislation that recognizes robots.txt and meta tag signals as valid content boundaries

Call to Action

Creators: Start adding AI robots.txt directives and <meta> tags to your pages or web content. Adding extra lines to robots.txt or custom <meta> tags won’t break anything. They’ll simply be ignored by agents that don’t recognize them.

Developers: Help build tooling to make this easy and visible. For example, update existing plugins, SEO tools, generators with potential tags and robots directives.

AI Companies: Respect the signals. Seek consent. Compensate when appropriate. This will improve your reputation and goodwill with the people you depend on.

Policymakers: Recognize that the consent of creators matters and simple technical tools can help codify it. Help protect the small creators against the power and might of big corporations.

Known AI Training Bots and Their User-Agents

Bot Name	User-Agent	Organization	Notes
GPTBot	`GPTBot`	OpenAI	Used for ChatGPT data collection
CCBot	`CCBot`	Common Crawl	Used by many AI orgs for training
AnthropicBot	`AnthropicBot`	Anthropic	Claude’s training data collector
Google-Extended	`Google-Extended`	Google	Used to control Bard/Gemini access
ClaudeBot	`ClaudeBot`	Anthropic	May be used in future, unclear naming consistency
AnyBot	`Anybot` / `AnyaiBot`	AnyAI (startup)	Data access for smaller AI orgs
MistralBot	`MistralBot`	Mistral AI	European AI startup
YouBot	`YouBot`	You.com	AI-powered search engine
MetaBot	`MetaAI` or `facebookexternalhit`	Meta	Used for LLaMA / image-text scraping
PerplexityBot	`PerplexityBot`	Perplexity AI	Crawler used for answer engine
NeevaBot	`NeevaBot`	Neeva (acquired by Snowflake)	Previously search/AI training
PhindBot	`PhindBot`	Phind	Developer-oriented search AI
KomoBot	`KomoBot`	Komo AI	Search + generative AI
YandexGPT	`YandexGPTBot`	Yandex	GPT-style Russian model training
DeepSeekBot	`DeepSeekBot`	DeepSeek	New AI research org
AmazonBot	`Amazonbot`	Amazon	May support internal AI work

Sample robots.txt for AI Training Crawler Control

# Block AI training bots by default
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: AnthropicBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: MetaAI
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: Anybot
Disallow: /

User-agent: AnyaiBot
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: NeevaBot
Disallow: /

User-agent: PhindBot
Disallow: /

User-agent: KomoBot
Disallow: /

User-agent: YandexGPTBot
Disallow: /

User-agent: DeepSeekBot
Disallow: /

User-agent: Amazonbot
Disallow: /

# Catch-all rule for unknown AI training bots
User-agent: *AI*
Disallow: /

User-agent: *bot
Disallow: /

User-agent: *crawler
Disallow: /

# Optional: allow search engine crawlers only
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: DuckDuckBot
Disallow:

User-agent: Applebot
Disallow:

WordPress Function to add a blanket ai-train meta tag

Here’s a simple WordPress-compatible function you can add to your theme’s functions.php file to inject a custom meta tag into the <head> of every page or post on your site. You can modify it to be a bit more nuanced if you want to make per-post rules using post meta, but for simplicity, this version is a site-wide default “No”.

function add_ai_train_meta_tag() {
echo '<meta name="ai-train" content="no">' . "\n";
}
add_action('wp_head', 'add_ai_train_meta_tag');