← Back to DominateTools
AI PROTECTION

How to Block AI Bots with robots.txt in 2026

Your content is being scraped by AI training bots right now. Here's the definitive guide to stopping GPTBot, ClaudeBot, CCBot, and 20+ other AI crawlers using a properly configured robots.txt file.

Updated March 2026 · 13 min read

Table of Contents

Every day, billions of web pages are crawled not by search engines looking to index your content, but by AI companies looking to train their large language models. The content you spend hours crafting — your blog posts, product descriptions, research papers, creative writing — is being harvested and used to train AI models by companies like OpenAI, Anthropic, Google, Meta, and ByteDance. The worst part? Most website owners don't even know it's happening.

The good news is that there's a simple, free, and universally recognized way to tell these bots to stay out: your robots.txt file. In this comprehensive guide, we'll show you exactly which AI bots are crawling the web in 2026, how to identify them in your server logs, and how to write the precise robots.txt rules needed to block every single one of them from accessing your website.

Block AI Bots Instantly

Don't want to write robots.txt rules by hand? Our generator blocks 24+ AI bots with a single toggle.

Open Robots.txt Generator →

The AI Scraping Problem in 2026

The scale of AI scraping has reached unprecedented levels. According to industry reports published in early 2026, the total volume of web crawling attributable to AI training bots now exceeds the combined crawl volume of all traditional search engines. This represents a fundamental shift in how the web is being consumed — your content is no longer being crawled primarily to help users find it, but to feed machine learning pipelines.

The implications are significant for content creators, publishers, and businesses of all sizes. When your content is used to train an AI model, that model can then generate derivative content that competes directly with your original work. A blog post you wrote about kitchen renovations might train a model that then generates thousands of similar articles for competitors. Your carefully researched product descriptions might be synthesized into competing listings. Your unique voice and brand personality become raw material for a machine that can replicate it at scale.

This isn't hypothetical — it's already happening. Multiple lawsuits filed in 2024 and 2025 by publishers, authors, and content creators against AI companies have highlighted the massive scope of unauthorized content harvesting. While the legal landscape continues to evolve, the most immediate protection available to every website owner is the robots.txt file.

Complete List of AI Bots to Block in 2026

The AI scraping landscape changes rapidly as new companies launch crawlers and existing ones rebrand or expand their bot networks. Here is the most comprehensive, up-to-date list of AI training bots as of March 2026:

User-agent Company Purpose Respects robots.txt?
GPTBot OpenAI Training GPT models Yes
ChatGPT-User OpenAI Browse mode in ChatGPT Yes
OAI-SearchBot OpenAI SearchGPT web search Yes
ClaudeBot Anthropic Training Claude models Yes
anthropic-ai Anthropic Legacy Claude crawler Yes
Google-Extended Google Training Gemini (not Search indexing) Yes
CCBot Common Crawl Open training corpus Yes
Bytespider ByteDance Training Doubao/TikTok AI Yes
FacebookBot Meta Training LLaMA models Yes
Meta-ExternalAgent Meta AI training collection Yes
PerplexityBot Perplexity AI AI search crawling Yes
cohere-ai Cohere Training language models Yes
Diffbot Diffbot Knowledge graph extraction Yes
Applebot-Extended Apple Training Apple Intelligence Yes
Amazonbot Amazon Training Alexa/AI models Yes
YouBot You.com AI search crawling Yes
Important Distinction Blocking Google-Extended does NOT affect Google Search indexing. Google-Extended controls only whether your content can be used to train Google's Gemini AI models. Googlebot — the search indexer — is a completely separate user-agent.

How to Write the Block Rules

Adding AI bot protection to your robots.txt is straightforward. Each bot requires its own User-agent and Disallow block. While you can technically group some bots together, the safest approach is to give each bot its own dedicated block to ensure maximum compatibility across different crawler implementations.

Here is the copy-paste robots.txt snippet that blocks all major AI training bots. Add this to your existing robots.txt file, after your standard search engine rules:

# ========================================== # AI SCRAPER PROTECTION (Updated March 2026) # ========================================== # OpenAI User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-SearchBot Disallow: / # Anthropic User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / # Google AI Training (does NOT affect Search) User-agent: Google-Extended Disallow: / # Common Crawl User-agent: CCBot Disallow: / # ByteDance / TikTok User-agent: Bytespider Disallow: / # Meta / Facebook User-agent: FacebookBot Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Perplexity User-agent: PerplexityBot Disallow: / # Others User-agent: cohere-ai Disallow: / User-agent: Diffbot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Amazonbot Disallow: / User-agent: YouBot Disallow: /

Selective Blocking vs. Full Blocking

Not everyone wants to block all AI bots entirely. Some website owners prefer a more nuanced approach, allowing certain AI services like SearchGPT or Perplexity to access their content (since those services drive traffic) while blocking pure training bots. Here are the three main approaches:

Approach What It Does Best For
Full Block Blocks all AI bots from every page Publishers, content creators, agencies with original content
Selective Block Blocks training bots but allows AI search bots Businesses wanting traffic from AI search engines
Partial Allow Blocks bots from premium content but allows free content Freemium publishers and SaaS companies

For selective blocking, you would simply omit the User-agent blocks for bots you want to allow. For example, if you want Perplexity AI to be able to cite your content in search results (which drives referral traffic), remove the PerplexityBot block from your robots.txt.

For partial allowing, use path-based rules instead of blanket Disallow: /. For instance, you might allow GPTBot to access your blog (Allow: /blog/) but block it from your product pages (Disallow: /products/).

How to Detect AI Bots in Your Server Logs

Before implementing blocks, it's helpful to understand which AI bots are actually visiting your site and how much bandwidth they're consuming. You can analyze your server access logs to identify bot traffic patterns and prioritize which bots to block first.

On an Apache or Nginx server, access logs typically contain the full User-Agent string for each request. Here's how to search for AI bot activity in common log formats. For Apache logs on Linux, you can use:

# Search for GPTBot in Apache access logs grep -i "gptbot\|claudebot\|ccbot\|bytespider" /var/log/apache2/access.log # Count requests per AI bot awk -F'"' '{print $6}' /var/log/apache2/access.log | grep -io "gptbot\|claudebot\|ccbot\|bytespider\|facebookbot" | sort | uniq -c | sort -rn

If you're running a cloud-hosted site on platforms like Cloudflare, Vercel, or Netlify, you can typically access bot analytics through their dashboard. Cloudflare's Bot Analytics, for example, provides a dedicated view showing AI bot traffic categorized by the type of bot and the pages they access most frequently.

Many site owners are surprised to discover that AI bots account for a significant percentage of their total server traffic. In some cases, a single aggressive AI scraper like Bytespider can generate more requests than Googlebot. This not only wastes bandwidth but can also slow down your site for real users if your server is resource-constrained.

Beyond robots.txt: Additional Protection Layers

While robots.txt is the standard first line of defense, it's important to understand its limitations. The Robots Exclusion Protocol is purely voluntary — it relies on crawlers choosing to respect your directives. Here are additional measures you can implement alongside robots.txt for comprehensive AI protection:

HTTP Header Controls: Some AI companies also check HTTP response headers. You can add headers like X-Robots-Tag: noai, noimageai to signal that your content should not be used for AI training. While this standard is still emerging, several major AI companies have indicated they will respect these headers.

Server-Level Blocking: If you suspect a bot is ignoring your robots.txt, you can block its IP ranges at the server level. OpenAI, Google, and other major companies publish the IP ranges used by their bots. Adding these to your server's firewall or .htaccess file provides hard blocking that cannot be bypassed by ignoring robots.txt.

Cloudflare Bot Management: If your site is behind Cloudflare, you can create custom firewall rules that challenge or block requests from known AI bot user-agents. This provides server-level enforcement without modifying your web server configuration directly.

Rate Limiting: For bots that you allow partial access to, implement rate limiting to prevent them from overwhelming your server. Nginx and Apache both support rate limiting based on User-Agent strings, allowing you to throttle AI bots while keeping search engine bots unthrottled.

Meta Tags: For page-level control, you can add <meta name="robots" content="noai, noimageai"> to individual pages. This emerging standard provides granular, page-by-page control over AI training data collection.

Platform-Specific Implementation

The method for editing your robots.txt file varies by platform. Here's how to add AI bot protection on the most popular platforms:

WordPress

If you're using an SEO plugin like Yoast or Rank Math, navigate to the plugin's settings to find the robots.txt editor. If you're editing the file directly, use FTP/SFTP to access your site root and edit or create the robots.txt file. Many managed WordPress hosts also provide a file manager in their control panel.

Shopify

As of 2026, Shopify allows merchants to customize their robots.txt file through the robots.txt.liquid template in their theme files. Navigate to Online Store → Themes → Edit Code and look for the robots.txt.liquid template to add your AI bot rules.

Static Sites (Netlify, Vercel, GitHub Pages)

For static sites, simply create or edit the robots.txt file in your project's public or root directory. On Netlify, place it in the /public/ folder. On Vercel, place it in the /public/ folder. On GitHub Pages, place it in the repository root.

Squarespace and Wix

These platforms have traditionally restricted robots.txt customization. Check your platform's current documentation for the latest options, as both have been expanding developer controls in response to the AI scraping concerns raised by their users.

One-Click AI Bot Protection

Our free Robots.txt Generator includes a pre-built list of 24+ AI bots. Just flip the toggle and download your file.

Generate Protected robots.txt →

Frequently Asked Questions

Does blocking AI bots affect my Google rankings?
No. AI training bots like GPTBot and ClaudeBot are entirely separate from search engine crawlers. Blocking them has zero impact on your Google or Bing rankings because Google uses Googlebot for indexing, not GPTBot. Similarly, blocking Google-Extended only prevents Gemini training — it does not affect Google Search whatsoever.
What is GPTBot and should I block it?
GPTBot is OpenAI's web crawler used to collect training data for GPT models. If you don't want your content used to train AI, you should block GPTBot by adding User-agent: GPTBot followed by Disallow: / to your robots.txt. OpenAI has officially stated that GPTBot respects robots.txt directives.
How many AI bots should I block?
As of 2026, there are at least 24 known AI training bots from companies including OpenAI, Anthropic, Google, Meta, ByteDance, Apple, Amazon, and others. It's best to block all of them unless you explicitly want your content used for AI training. Our Robots.txt Generator maintains an updated blocklist.
Will blocking AI bots slow down my website?
Quite the opposite — blocking AI bots can actually improve server performance by reducing the number of crawl requests your server handles. AI scrapers can be very aggressive crawlers that consume significant bandwidth and server resources, often more than traditional search engine bots.
Can AI bots ignore my robots.txt?
Technically, robots.txt is a voluntary protocol. While major companies like OpenAI, Google, and Anthropic have committed to respecting robots.txt, smaller or rogue scrapers may ignore it. For additional protection, consider server-level blocking via .htaccess, Cloudflare firewall rules, or IP-based restrictions.

Related Resources