The robots.txt file is one of the oldest and most powerful tools in a webmaster's arsenal. Despite being a simple plain-text file, it controls billions of dollars' worth of web traffic by telling search engine crawlers — and now AI scrapers — which parts of your site they are allowed to access. Yet a surprising number of websites get the syntax wrong, leading to pages being accidentally blocked from Google or, worse, leaving sensitive directories wide open to data-mining bots.
In this comprehensive reference guide, we'll walk through every directive in the robots.txt specification, explain how Google, Bing, and newer AI bots interpret each one, and provide copy-paste examples you can adapt for your own site. Whether you're a seasoned technical SEO or just deploying your first website, this guide will become your go-to robots.txt cheat sheet.
Skip the Manual Work
Generate a perfectly formatted robots.txt file — including AI bot protection — with our free tool.
Open Robots.txt Generator →1. What Is robots.txt and Why Does It Matter?
The Robots Exclusion Protocol (REP) was introduced in 1994 as an informal standard for website owners to communicate with web crawlers. The protocol specifies that a plain-text file named robots.txt must be placed at the root of a domain (e.g., https://example.com/robots.txt). When a compliant crawler visits a site, it first checks this file before crawling any pages.
The robots.txt file serves several critical purposes in modern web management. First, it protects sensitive directories from being crawled and indexed. Pages like admin panels, internal search results, or staging environments should never appear in search results, and robots.txt provides the first line of defense. Second, it conserves your crawl budget. Google allocates a finite number of pages it will crawl on your site during each crawl session. If crawlers waste time on low-value pages like faceted navigation, paginated archives, or print-friendly URLs, your most important content may be discovered and indexed more slowly.
Third — and this is the biggest development in recent years — robots.txt now plays a pivotal role in controlling AI scraping. Companies like OpenAI, Anthropic, Meta, and ByteDance deploy their own crawlers (GPTBot, ClaudeBot, FacebookBot, Bytespider) to harvest web content for training large language models. Without explicit Disallow rules targeting these user-agents, your original articles, product descriptions, and creative content may end up in AI training datasets without your consent.
2. Core Directives: The Building Blocks of robots.txt
Every robots.txt file is composed of one or more "groups." Each group begins with a User-agent line and is followed by one or more directive lines. Let's break down each directive in detail.
User-agent
The User-agent directive specifies which crawler the following rules apply to. You can target a specific bot or use the wildcard * to apply rules to all crawlers simultaneously. Crawler names are case-insensitive in practice, though the specification recommends matching the exact casing used by the bot.
# Target all crawlers
User-agent: *
Disallow: /private/
# Target only Googlebot
User-agent: Googlebot
Disallow: /no-google/
# Target an AI scraper
User-agent: GPTBot
Disallow: /When a crawler encounters multiple groups, it looks for the most specific match first. If there is a group explicitly naming its user-agent, it follows those rules. If not, it falls back to the User-agent: * block. This means you can create a permissive default policy and then override it with stricter rules for specific bots.
Disallow
The Disallow directive tells crawlers which URL paths they must not access. The path is relative to the site root and is case-sensitive. An empty Disallow: value means "allow everything" — this is sometimes used intentionally in a User-agent block to indicate that a specific bot is welcomed to crawl all content.
# Block the entire /admin/ directory
Disallow: /admin/
# Block a specific file
Disallow: /config.php
# Allow everything (empty disallow)
Disallow:One of the most common mistakes is using Disallow: / without realizing it blocks the entire website. This single slash tells all targeted bots that every page on your domain is off-limits. While this might be intentional for staging environments, accidentally deploying this rule on a production site can cause your pages to be deindexed from Google within days.
Allow
The Allow directive is supported by Google and Bing (but not part of the original 1994 specification). It overrides a Disallow rule for a more specific path. This is extremely useful when you want to block an entire directory but allow specific files within it.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.phpIn the example above, the entire /wp-admin/ directory is blocked, but admin-ajax.php is explicitly allowed because many WordPress themes and plugins require this file to function correctly on the front end. Without this Allow exception, certain dynamic features on your WordPress site could break for search engine renderers.
Sitemap
The Sitemap directive specifies the full URL to your XML sitemap file. Unlike other directives, Sitemap is not tied to any specific User-agent group — it is a standalone declaration typically placed at the bottom of the file. You can include multiple Sitemap directives if your site uses sitemap index files or separate sitemaps for different content types.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xmlWhile submitting your sitemap through Google Search Console and Bing Webmaster Tools is the recommended primary method, including it in robots.txt provides a helpful fallback and ensures that any compliant crawler — including lesser-known search engines — can discover your sitemap automatically.
Crawl-delay
The Crawl-delay directive specifies the minimum number of seconds a crawler should wait between successive requests. This was initially introduced by Yandex and is also respected by Bing. Google does not honor the Crawl-delay directive — instead, Google provides crawl rate controls through Google Search Console.
User-agent: Bingbot
Crawl-delay: 10
User-agent: Yandex
Crawl-delay: 5For small servers or shared hosting environments, setting a crawl delay can prevent bot traffic from overwhelming your server resources. However, setting too high a value (e.g., 30+ seconds) can dramatically slow down how quickly Bing discovers and indexes new pages.
3. Wildcards and Pattern Matching
Google and Bing extend the basic robots.txt specification with support for two wildcard characters. These are not part of the original standard, so not all bots support them, but the major search engines do.
| Wildcard | Meaning | Example | Effect |
|---|---|---|---|
* |
Matches any sequence of characters | Disallow: /search*sort= |
Blocks all search pages with sort parameters |
$ |
Anchors the match to the end of the URL | Disallow: /*.pdf$ |
Blocks all URLs ending in .pdf |
* + $ |
Combined for precision targeting | Disallow: /*?sessionid=*$ |
Blocks session ID URLs |
Wildcards are incredibly powerful for managing large, parameter-heavy websites like e-commerce stores. Instead of manually listing hundreds of filtered product URLs, a single wildcard pattern like Disallow: /*?color=*&size=* can block all product filter combinations from being crawled.
4. Directive Precedence: How Conflicts Are Resolved
When a URL matches both an Allow and a Disallow rule, crawlers use the "most specific path wins" principle. The rule with the longer path pattern takes precedence. If both rules have the same path length, Google defaults to allowing the URL — while other bots may differ.
| Rule | URL | Result | Reason |
|---|---|---|---|
Disallow: /p |
/page | Blocked | /p matches the start of /page |
Allow: /page |
/page | Allowed | /page is more specific than /p |
Disallow: /folder |
/folder/page | Blocked | Disallow matches the prefix |
Allow: /folder/page |
/folder/page | Allowed | Allow is more specific |
5. Common Syntax Mistakes to Avoid
Even experienced developers make robots.txt errors that can have serious consequences for SEO. Here are the most frequent mistakes we encounter during site audits:
Mistake #1: Using a relative Sitemap URL. The Sitemap directive requires an absolute URL including the protocol and domain. Writing Sitemap: /sitemap.xml instead of Sitemap: https://example.com/sitemap.xml will cause most crawlers to ignore the directive entirely.
Mistake #2: Adding a trailing space after a path. Some text editors insert invisible trailing spaces. A directive like Disallow: /admin/ (with a trailing space) may not match /admin/ as intended, leading to the directory being left unprotected.
Mistake #3: Using tabs instead of spaces. While most modern crawlers handle tabs gracefully, the specification calls for a single space between the directive name and its value. Using tabs can cause parsing issues with older or less common bots.
Mistake #4: Forgetting the colon. Writing Disallow /admin/ without the colon after Disallow makes the entire line invalid. The crawler will skip it silently, and your admin directory will be left open to crawling.
Mistake #5: Placing robots.txt in a subdirectory. The file must be served from the exact path /robots.txt at the root of your domain. A file at /blog/robots.txt or /public/robots.txt will be completely ignored by all crawlers.
6. robots.txt for Different Platforms
Different CMS platforms and frameworks have unique directory structures that require specific robots.txt configurations. Here is a quick reference for the most popular platforms:
| Platform | Key Directories to Block | Key Directories to Allow |
|---|---|---|
| WordPress | /wp-admin/, /wp-includes/ |
/wp-admin/admin-ajax.php |
| Shopify | /admin/, /cart/, /checkout/ |
/collections/, /products/ |
| Next.js | /_next/data/, /api/ |
/_next/static/ |
| Django | /admin/, /static/admin/ |
/static/, /media/ |
| Laravel | /storage/, /vendor/ |
/public/ |
Our Robots.txt Generator includes built-in presets for WordPress, Shopify, and Next.js that automatically configure all the correct rules for each platform. Simply select your platform and the generator handles the rest — including AI bot protection rules.
7. AI Bot Directives: The 2026 Addition
The most significant change to robots.txt in recent years is the explosion of AI bots that crawl the web for training data. Unlike search engine bots, these crawlers do not index your pages or drive traffic to your site — they harvest your content to improve their language models. Here is the current list of known AI training bots and their respective operators:
| User-agent | Operator | Purpose |
|---|---|---|
GPTBot |
OpenAI | Training GPT models |
ChatGPT-User |
OpenAI | Live browsing in ChatGPT |
ClaudeBot |
Anthropic | Training Claude models |
Google-Extended |
Training Gemini (not Search) | |
CCBot |
Common Crawl | Open-source training corpora |
Bytespider |
ByteDance | Training TikTok/Doubao models |
FacebookBot |
Meta | Training LLaMA models |
PerplexityBot |
Perplexity AI | AI search engine crawling |
Blocking these bots does not affect your search engine rankings whatsoever. Googlebot and Bingbot are entirely separate user-agents from Google-Extended and other AI training bots. You can safely block all AI scrapers while maintaining full visibility in Google and Bing search results.
8. Testing and Validating Your robots.txt
Before deploying a new robots.txt file, always validate it to catch syntax errors and unintended blocking rules. There are several methods for testing your configuration:
Google Search Console: The Robots.txt Tester tool in Google Search Console lets you paste your robots.txt content and test specific URLs against it. You'll see exactly which directive is blocking or allowing each URL. This is the most authoritative testing tool since it uses Google's actual parser.
Bing Webmaster Tools: Bing offers a similar robots.txt analysis feature that shows how Bingbot will interpret your rules. Since Bing's parser has some differences from Google's (particularly around crawl-delay support), testing on both platforms is recommended for enterprise sites.
Our Generator's Built-in Validator: When you use the DominateTools Robots.txt Generator, the output panel includes a real-time validator that highlights potential issues as you configure your rules. It checks for common mistakes like blocking your sitemap, using invalid paths, and missing AI bot rules.
A solid testing workflow involves three steps: First, generate or write your robots.txt file. Second, test it against your most important URLs (homepage, key landing pages, blog posts, and admin URLs). Third, deploy it and monitor Google Search Console's crawl stats for any unexpected changes in crawl behavior over the following two weeks.
9. Real-World robots.txt Examples
Let's put everything together with a production-ready example that covers search engine optimization, AI protection, and platform-specific rules for a WordPress site:
# Search engine crawlers
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /search/
Disallow: /tag/
Disallow: /*?s=
Disallow: /*?replytocom=
Disallow: /feed/
# AI Scraper Protection
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
# Sitemap
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xmlThis configuration allows Google and Bing to crawl all important content while blocking admin areas, search result pages, tag archives, comment feeds, and query-string duplicates. It then explicitly blocks every major AI training bot from accessing any page on the site. Finally, it points search engines to two sitemap files for efficient content discovery.
Generate Your robots.txt in Seconds
Don't write robots.txt by hand. Use our free generator with built-in AI protection, platform presets, and real-time validation.
Open Robots.txt Generator →10. Frequently Asked Questions
What is the correct syntax for a robots.txt file?
User-agent: followed by the bot name or * for all bots, then Disallow: or Allow: lines specifying paths. A Sitemap: directive at the end points search engines to your XML sitemap. The file must be named exactly robots.txt and placed at your domain's root.
Does robots.txt support wildcards?
*) matches any sequence of characters, and the dollar sign ($) anchors a match to the end of a URL. For example, Disallow: /*.pdf$ blocks all PDF files. These wildcards are not part of the original specification, so less common bots may not support them.
Is robots.txt case-sensitive?
User-agent, Disallow, Allow) are case-insensitive, but the path values ARE case-sensitive. /Admin/ and /admin/ are treated as different paths by crawlers. Always match the exact casing used in your website's URL structure.
What happens if my robots.txt has a syntax error?
User-agent line may cause the entire block to be skipped, potentially allowing bots to crawl everything. Always test with Google Search Console's robots.txt tester before deploying.
Where should I place the robots.txt file?
https://yourdomain.com/robots.txt. It cannot be placed in subdirectories; each subdomain needs its own robots.txt file. For example, blog.example.com would need a separate robots.txt from www.example.com.
Related Resources
- How to Block AI Bots Using robots.txt — Step-by-step guide to protecting your content from AI scrapers
- Robots.txt vs. Meta Robots Tag — When to use each, and how they interact
- Best Robots.txt for WordPress in 2026 — WordPress-specific configuration tips
- Crawl Budget Optimization Guide — Maximize how Google crawls your site
- Free Robots.txt Generator — Create your perfect robots.txt in seconds