How to Block AI Bots with robots.txt in 2026 (Complete Guide)

Every day, billions of web pages are crawled not by search engines looking to index your content, but by AI companies looking to train their large language models. The content you spend hours crafting — your blog posts, product descriptions, research papers, creative writing — is being harvested and used to train AI models by companies like OpenAI, Anthropic, Google, Meta, and ByteDance. The worst part? Most website owners don't even know it's happening.

The good news is that there's a simple, free, and universally recognized way to tell these bots to stay out: your robots.txt file. In this comprehensive guide, we'll show you exactly which AI bots are crawling the web in 2026, how to identify them in your server logs, and how to write the precise robots.txt rules needed to block every single one of them from accessing your website.

Block AI Bots Instantly

Don't want to write robots.txt rules by hand? Our generator blocks 24+ AI bots with a single toggle.

Open Robots.txt Generator →

The AI Scraping Problem in 2026

The scale of AI scraping has reached unprecedented levels. According to industry reports published in early 2026, the total volume of web crawling attributable to AI training bots now exceeds the combined crawl volume of all traditional search engines. This represents a fundamental shift in how the web is being consumed — your content is no longer being crawled primarily to help users find it, but to feed machine learning pipelines.

The implications are significant for content creators, publishers, and businesses of all sizes. When your content is used to train an AI model, that model can then generate derivative content that competes directly with your original work. A blog post you wrote about kitchen renovations might train a model that then generates thousands of similar articles for competitors. Your carefully researched product descriptions might be synthesized into competing listings. Your unique voice and brand personality become raw material for a machine that can replicate it at scale.

This isn't hypothetical — it's already happening. Multiple lawsuits filed in 2024 and 2025 by publishers, authors, and content creators against AI companies have highlighted the massive scope of unauthorized content harvesting. While the legal landscape continues to evolve, the most immediate protection available to every website owner is the robots.txt file.

Complete List of AI Bots to Block in 2026

The AI scraping landscape changes rapidly as new companies launch crawlers and existing ones rebrand or expand their bot networks. Here is the most comprehensive, up-to-date list of AI training bots as of March 2026:

User-agent	Company	Purpose	Respects robots.txt?
`GPTBot`	OpenAI	Training GPT models	Yes
`ChatGPT-User`	OpenAI	Browse mode in ChatGPT	Yes
`OAI-SearchBot`	OpenAI	SearchGPT web search	Yes
`ClaudeBot`	Anthropic	Training Claude models	Yes
`anthropic-ai`	Anthropic	Legacy Claude crawler	Yes
`Google-Extended`	Google	Training Gemini (not Search indexing)	Yes
`CCBot`	Common Crawl	Open training corpus	Yes
`Bytespider`	ByteDance	Training Doubao/TikTok AI	Yes
`FacebookBot`	Meta	Training LLaMA models	Yes
`Meta-ExternalAgent`	Meta	AI training collection	Yes
`PerplexityBot`	Perplexity AI	AI search crawling	Yes
`cohere-ai`	Cohere	Training language models	Yes
`Diffbot`	Diffbot	Knowledge graph extraction	Yes
`Applebot-Extended`	Apple	Training Apple Intelligence	Yes
`Amazonbot`	Amazon	Training Alexa/AI models	Yes
`YouBot`	You.com	AI search crawling	Yes

Important Distinction Blocking Google-Extended does NOT affect Google Search indexing. Google-Extended controls only whether your content can be used to train Google's Gemini AI models. Googlebot — the search indexer — is a completely separate user-agent.

How to Write the Block Rules

Adding AI bot protection to your robots.txt is straightforward. Each bot requires its own User-agent and Disallow block. While you can technically group some bots together, the safest approach is to give each bot its own dedicated block to ensure maximum compatibility across different crawler implementations.

Here is the copy-paste robots.txt snippet that blocks all major AI training bots. Add this to your existing robots.txt file, after your standard search engine rules:

# ==========================================
# AI SCRAPER PROTECTION (Updated March 2026)
# ==========================================

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Google AI Training (does NOT affect Search)
User-agent: Google-Extended
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# ByteDance / TikTok
User-agent: Bytespider
Disallow: /

# Meta / Facebook
User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# Others
User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: YouBot
Disallow: /

Selective Blocking vs. Full Blocking

Not everyone wants to block all AI bots entirely. Some website owners prefer a more nuanced approach, allowing certain AI services like SearchGPT or Perplexity to access their content (since those services drive traffic) while blocking pure training bots. Here are the three main approaches:

Approach	What It Does	Best For
Full Block	Blocks all AI bots from every page	Publishers, content creators, agencies with original content
Selective Block	Blocks training bots but allows AI search bots	Businesses wanting traffic from AI search engines
Partial Allow	Blocks bots from premium content but allows free content	Freemium publishers and SaaS companies

For selective blocking, you would simply omit the User-agent blocks for bots you want to allow. For example, if you want Perplexity AI to be able to cite your content in search results (which drives referral traffic), remove the PerplexityBot block from your robots.txt.

For partial allowing, use path-based rules instead of blanket Disallow: /. For instance, you might allow GPTBot to access your blog (Allow: /blog/) but block it from your product pages (Disallow: /products/).

How to Detect AI Bots in Your Server Logs

Before implementing blocks, it's helpful to understand which AI bots are actually visiting your site and how much bandwidth they're consuming. You can analyze your server access logs to identify bot traffic patterns and prioritize which bots to block first.

On an Apache or Nginx server, access logs typically contain the full User-Agent string for each request. Here's how to search for AI bot activity in common log formats. For Apache logs on Linux, you can use:

# Search for GPTBot in Apache access logs
grep -i "gptbot\|claudebot\|ccbot\|bytespider" /var/log/apache2/access.log

# Count requests per AI bot
awk -F'"' '{print $6}' /var/log/apache2/access.log | grep -io "gptbot\|claudebot\|ccbot\|bytespider\|facebookbot" | sort | uniq -c | sort -rn

If you're running a cloud-hosted site on platforms like Cloudflare, Vercel, or Netlify, you can typically access bot analytics through their dashboard. Cloudflare's Bot Analytics, for example, provides a dedicated view showing AI bot traffic categorized by the type of bot and the pages they access most frequently.

Many site owners are surprised to discover that AI bots account for a significant percentage of their total server traffic. In some cases, a single aggressive AI scraper like Bytespider can generate more requests than Googlebot. This not only wastes bandwidth but can also slow down your site for real users if your server is resource-constrained.

Beyond robots.txt: Additional Protection Layers

While robots.txt is the standard first line of defense, it's important to understand its limitations. The Robots Exclusion Protocol is purely voluntary — it relies on crawlers choosing to respect your directives. Here are additional measures you can implement alongside robots.txt for comprehensive AI protection:

HTTP Header Controls: Some AI companies also check HTTP response headers. You can add headers like X-Robots-Tag: noai, noimageai to signal that your content should not be used for AI training. While this standard is still emerging, several major AI companies have indicated they will respect these headers.

Server-Level Blocking: If you suspect a bot is ignoring your robots.txt, you can block its IP ranges at the server level. OpenAI, Google, and other major companies publish the IP ranges used by their bots. Adding these to your server's firewall or .htaccess file provides hard blocking that cannot be bypassed by ignoring robots.txt.

Cloudflare Bot Management: If your site is behind Cloudflare, you can create custom firewall rules that challenge or block requests from known AI bot user-agents. This provides server-level enforcement without modifying your web server configuration directly.

Rate Limiting: For bots that you allow partial access to, implement rate limiting to prevent them from overwhelming your server. Nginx and Apache both support rate limiting based on User-Agent strings, allowing you to throttle AI bots while keeping search engine bots unthrottled.

Meta Tags: For page-level control, you can add <meta name="robots" content="noai, noimageai"> to individual pages. This emerging standard provides granular, page-by-page control over AI training data collection.

Platform-Specific Implementation

The method for editing your robots.txt file varies by platform. Here's how to add AI bot protection on the most popular platforms:

WordPress

If you're using an SEO plugin like Yoast or Rank Math, navigate to the plugin's settings to find the robots.txt editor. If you're editing the file directly, use FTP/SFTP to access your site root and edit or create the robots.txt file. Many managed WordPress hosts also provide a file manager in their control panel.

Shopify

As of 2026, Shopify allows merchants to customize their robots.txt file through the robots.txt.liquid template in their theme files. Navigate to Online Store → Themes → Edit Code and look for the robots.txt.liquid template to add your AI bot rules.

Static Sites (Netlify, Vercel, GitHub Pages)

For static sites, simply create or edit the robots.txt file in your project's public or root directory. On Netlify, place it in the /public/ folder. On Vercel, place it in the /public/ folder. On GitHub Pages, place it in the repository root.

Squarespace and Wix

These platforms have traditionally restricted robots.txt customization. Check your platform's current documentation for the latest options, as both have been expanding developer controls in response to the AI scraping concerns raised by their users.

One-Click AI Bot Protection

Our free Robots.txt Generator includes a pre-built list of 24+ AI bots. Just flip the toggle and download your file.

Generate Protected robots.txt →

Frequently Asked Questions

Does blocking AI bots affect my Google rankings?

No. AI training bots like GPTBot and ClaudeBot are entirely separate from search engine crawlers. Blocking them has zero impact on your Google or Bing rankings because Google uses Googlebot for indexing, not GPTBot. Similarly, blocking Google-Extended only prevents Gemini training — it does not affect Google Search whatsoever.

What is GPTBot and should I block it?

GPTBot is OpenAI's web crawler used to collect training data for GPT models. If you don't want your content used to train AI, you should block GPTBot by adding User-agent: GPTBot followed by Disallow: / to your robots.txt. OpenAI has officially stated that GPTBot respects robots.txt directives.

How many AI bots should I block?

As of 2026, there are at least 24 known AI training bots from companies including OpenAI, Anthropic, Google, Meta, ByteDance, Apple, Amazon, and others. It's best to block all of them unless you explicitly want your content used for AI training. Our Robots.txt Generator maintains an updated blocklist.

Will blocking AI bots slow down my website?

Quite the opposite — blocking AI bots can actually improve server performance by reducing the number of crawl requests your server handles. AI scrapers can be very aggressive crawlers that consume significant bandwidth and server resources, often more than traditional search engine bots.

Can AI bots ignore my robots.txt?

Technically, robots.txt is a voluntary protocol. While major companies like OpenAI, Google, and Anthropic have committed to respecting robots.txt, smaller or rogue scrapers may ignore it. For additional protection, consider server-level blocking via .htaccess, Cloudflare firewall rules, or IP-based restrictions.

Related Resources

Block Gptbot Robots Txt — Related reading
Automated Link Checking 2026 — Related reading
Automated Feed Crawling And Discovery Optimization — Related reading
Robots.txt Syntax Explained — Master every directive and wildcard
Robots.txt vs. Meta Robots Tag — Understanding when to use each method
Crawl Budget Optimization Guide — Reduce wasted server resources
Free Robots.txt Generator — Generate AI-protected robots.txt in seconds