Every day, billions of web pages are crawled not by search engines looking to index your content, but by AI companies looking to train their large language models. The content you spend hours crafting — your blog posts, product descriptions, research papers, creative writing — is being harvested and used to train AI models by companies like OpenAI, Anthropic, Google, Meta, and ByteDance. The worst part? Most website owners don't even know it's happening.
The good news is that there's a simple, free, and universally recognized way to tell these bots to stay out: your robots.txt file. In this comprehensive guide, we'll show you exactly which AI bots are crawling the web in 2026, how to identify them in your server logs, and how to write the precise robots.txt rules needed to block every single one of them from accessing your website.
Block AI Bots Instantly
Don't want to write robots.txt rules by hand? Our generator blocks 24+ AI bots with a single toggle.
Open Robots.txt Generator →The AI Scraping Problem in 2026
The scale of AI scraping has reached unprecedented levels. According to industry reports published in early 2026, the total volume of web crawling attributable to AI training bots now exceeds the combined crawl volume of all traditional search engines. This represents a fundamental shift in how the web is being consumed — your content is no longer being crawled primarily to help users find it, but to feed machine learning pipelines.
The implications are significant for content creators, publishers, and businesses of all sizes. When your content is used to train an AI model, that model can then generate derivative content that competes directly with your original work. A blog post you wrote about kitchen renovations might train a model that then generates thousands of similar articles for competitors. Your carefully researched product descriptions might be synthesized into competing listings. Your unique voice and brand personality become raw material for a machine that can replicate it at scale.
This isn't hypothetical — it's already happening. Multiple lawsuits filed in 2024 and 2025 by publishers, authors, and content creators against AI companies have highlighted the massive scope of unauthorized content harvesting. While the legal landscape continues to evolve, the most immediate protection available to every website owner is the robots.txt file.
Complete List of AI Bots to Block in 2026
The AI scraping landscape changes rapidly as new companies launch crawlers and existing ones rebrand or expand their bot networks. Here is the most comprehensive, up-to-date list of AI training bots as of March 2026:
| User-agent | Company | Purpose | Respects robots.txt? |
|---|---|---|---|
GPTBot |
OpenAI | Training GPT models | Yes |
ChatGPT-User |
OpenAI | Browse mode in ChatGPT | Yes |
OAI-SearchBot |
OpenAI | SearchGPT web search | Yes |
ClaudeBot |
Anthropic | Training Claude models | Yes |
anthropic-ai |
Anthropic | Legacy Claude crawler | Yes |
Google-Extended |
Training Gemini (not Search indexing) | Yes | |
CCBot |
Common Crawl | Open training corpus | Yes |
Bytespider |
ByteDance | Training Doubao/TikTok AI | Yes |
FacebookBot |
Meta | Training LLaMA models | Yes |
Meta-ExternalAgent |
Meta | AI training collection | Yes |
PerplexityBot |
Perplexity AI | AI search crawling | Yes |
cohere-ai |
Cohere | Training language models | Yes |
Diffbot |
Diffbot | Knowledge graph extraction | Yes |
Applebot-Extended |
Apple | Training Apple Intelligence | Yes |
Amazonbot |
Amazon | Training Alexa/AI models | Yes |
YouBot |
You.com | AI search crawling | Yes |
Google-Extended does NOT affect Google Search indexing. Google-Extended controls only whether your content can be used to train Google's Gemini AI models. Googlebot — the search indexer — is a completely separate user-agent.
How to Write the Block Rules
Adding AI bot protection to your robots.txt is straightforward. Each bot requires its own User-agent and Disallow block. While you can technically group some bots together, the safest approach is to give each bot its own dedicated block to ensure maximum compatibility across different crawler implementations.
Here is the copy-paste robots.txt snippet that blocks all major AI training bots. Add this to your existing robots.txt file, after your standard search engine rules:
# ==========================================
# AI SCRAPER PROTECTION (Updated March 2026)
# ==========================================
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Google AI Training (does NOT affect Search)
User-agent: Google-Extended
Disallow: /
# Common Crawl
User-agent: CCBot
Disallow: /
# ByteDance / TikTok
User-agent: Bytespider
Disallow: /
# Meta / Facebook
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Perplexity
User-agent: PerplexityBot
Disallow: /
# Others
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: YouBot
Disallow: /Selective Blocking vs. Full Blocking
Not everyone wants to block all AI bots entirely. Some website owners prefer a more nuanced approach, allowing certain AI services like SearchGPT or Perplexity to access their content (since those services drive traffic) while blocking pure training bots. Here are the three main approaches:
| Approach | What It Does | Best For |
|---|---|---|
| Full Block | Blocks all AI bots from every page | Publishers, content creators, agencies with original content |
| Selective Block | Blocks training bots but allows AI search bots | Businesses wanting traffic from AI search engines |
| Partial Allow | Blocks bots from premium content but allows free content | Freemium publishers and SaaS companies |
For selective blocking, you would simply omit the User-agent blocks for bots you want to allow. For example, if you want Perplexity AI to be able to cite your content in search results (which drives referral traffic), remove the PerplexityBot block from your robots.txt.
For partial allowing, use path-based rules instead of blanket Disallow: /. For instance, you might allow GPTBot to access your blog (Allow: /blog/) but block it from your product pages (Disallow: /products/).
How to Detect AI Bots in Your Server Logs
Before implementing blocks, it's helpful to understand which AI bots are actually visiting your site and how much bandwidth they're consuming. You can analyze your server access logs to identify bot traffic patterns and prioritize which bots to block first.
On an Apache or Nginx server, access logs typically contain the full User-Agent string for each request. Here's how to search for AI bot activity in common log formats. For Apache logs on Linux, you can use:
# Search for GPTBot in Apache access logs
grep -i "gptbot\|claudebot\|ccbot\|bytespider" /var/log/apache2/access.log
# Count requests per AI bot
awk -F'"' '{print $6}' /var/log/apache2/access.log | grep -io "gptbot\|claudebot\|ccbot\|bytespider\|facebookbot" | sort | uniq -c | sort -rnIf you're running a cloud-hosted site on platforms like Cloudflare, Vercel, or Netlify, you can typically access bot analytics through their dashboard. Cloudflare's Bot Analytics, for example, provides a dedicated view showing AI bot traffic categorized by the type of bot and the pages they access most frequently.
Many site owners are surprised to discover that AI bots account for a significant percentage of their total server traffic. In some cases, a single aggressive AI scraper like Bytespider can generate more requests than Googlebot. This not only wastes bandwidth but can also slow down your site for real users if your server is resource-constrained.
Beyond robots.txt: Additional Protection Layers
While robots.txt is the standard first line of defense, it's important to understand its limitations. The Robots Exclusion Protocol is purely voluntary — it relies on crawlers choosing to respect your directives. Here are additional measures you can implement alongside robots.txt for comprehensive AI protection:
HTTP Header Controls: Some AI companies also check HTTP response headers. You can add headers like X-Robots-Tag: noai, noimageai to signal that your content should not be used for AI training. While this standard is still emerging, several major AI companies have indicated they will respect these headers.
Server-Level Blocking: If you suspect a bot is ignoring your robots.txt, you can block its IP ranges at the server level. OpenAI, Google, and other major companies publish the IP ranges used by their bots. Adding these to your server's firewall or .htaccess file provides hard blocking that cannot be bypassed by ignoring robots.txt.
Cloudflare Bot Management: If your site is behind Cloudflare, you can create custom firewall rules that challenge or block requests from known AI bot user-agents. This provides server-level enforcement without modifying your web server configuration directly.
Rate Limiting: For bots that you allow partial access to, implement rate limiting to prevent them from overwhelming your server. Nginx and Apache both support rate limiting based on User-Agent strings, allowing you to throttle AI bots while keeping search engine bots unthrottled.
Meta Tags: For page-level control, you can add <meta name="robots" content="noai, noimageai"> to individual pages. This emerging standard provides granular, page-by-page control over AI training data collection.
Platform-Specific Implementation
The method for editing your robots.txt file varies by platform. Here's how to add AI bot protection on the most popular platforms:
WordPress
If you're using an SEO plugin like Yoast or Rank Math, navigate to the plugin's settings to find the robots.txt editor. If you're editing the file directly, use FTP/SFTP to access your site root and edit or create the robots.txt file. Many managed WordPress hosts also provide a file manager in their control panel.
Shopify
As of 2026, Shopify allows merchants to customize their robots.txt file through the robots.txt.liquid template in their theme files. Navigate to Online Store → Themes → Edit Code and look for the robots.txt.liquid template to add your AI bot rules.
Static Sites (Netlify, Vercel, GitHub Pages)
For static sites, simply create or edit the robots.txt file in your project's public or root directory. On Netlify, place it in the /public/ folder. On Vercel, place it in the /public/ folder. On GitHub Pages, place it in the repository root.
Squarespace and Wix
These platforms have traditionally restricted robots.txt customization. Check your platform's current documentation for the latest options, as both have been expanding developer controls in response to the AI scraping concerns raised by their users.
One-Click AI Bot Protection
Our free Robots.txt Generator includes a pre-built list of 24+ AI bots. Just flip the toggle and download your file.
Generate Protected robots.txt →Frequently Asked Questions
Does blocking AI bots affect my Google rankings?
Google-Extended only prevents Gemini training — it does not affect Google Search whatsoever.
What is GPTBot and should I block it?
User-agent: GPTBot followed by Disallow: / to your robots.txt. OpenAI has officially stated that GPTBot respects robots.txt directives.
How many AI bots should I block?
Will blocking AI bots slow down my website?
Can AI bots ignore my robots.txt?
Related Resources
- Block Gptbot Robots Txt — Related reading
- Automated Link Checking 2026 — Related reading
- Automated Feed Crawling And Discovery Optimization — Related reading
- Robots.txt Syntax Explained — Master every directive and wildcard
- Robots.txt vs. Meta Robots Tag — Understanding when to use each method
- Crawl Budget Optimization Guide — Reduce wasted server resources
- Free Robots.txt Generator — Generate AI-protected robots.txt in seconds