Your robots.txt Is Doing the Opposite of What You Think

I see this on more sites than I'd like to admit.

Someone wants to allow GPTBot, ClaudeBot, or PerplexityBot to crawl their site. So they add a clean Allow: / block for each one.

Their wildcard rules already disallow /admin/, /cart/, /checkout/, and a few other paths they don't want crawled. It looks correct. It reads correctly to a human.

It's completely wrong.

GPTBot just got permission to crawl /admin/, /cart/, and everything else you tried to block.

This is the most common robots.txt mistake I encounter. And it's getting more dangerous as people stack AI crawler rules on top of older configurations without checking how they interact.

The Mistake That Wipes Out Your Wildcard Rules

Here's the kind of robots.txt I'm talking about:

User-agent: GPTBot
Allow: /

User-agent: *
Disallow: /admin/
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /private/

Most people read this as "allow GPTBot full access, and additionally block these paths for everyone."

That's not how crawlers parse it.

The wildcard group does not combine with the GPTBot group. GPTBot reads its own block, sees Allow: / with no disallows, and is now free to crawl /admin/, /cart/, /private/, and everything else.

The same logic applies to any named user-agent. ClaudeBot, PerplexityBot, Bingbot. The moment you give a bot its own block, that bot stops reading your wildcard rules entirely.

Yes, this means your perfectly tuned User-agent: * disallow list goes silent for every named bot in the same file.

I've seen this exact pattern leak admin pages, internal search results, and staging directories. The site owner had no idea. The wildcard block looked thorough.

What the Spec Actually Says

This isn't an interpretation issue. RFC 9309, the official robots.txt standard since 2022, is explicit:

If no matching group exists, crawlers MUST obey the group with a user-agent line with the "*" value, if present.

The load-bearing phrase is "if no matching group exists."

The wildcard group is a fallback for bots that don't have their own block. It's not a base layer that named groups inherit from.

Google's documentation puts it more bluntly:

Only one group is valid for a particular crawler... Other groups are ignored. User agent specific groups and global groups (*) are not combined.

Two separate primary sources. One mechanism.

The crawler picks the most specific group that matches its name. It runs that group's rules. It ignores everything else in the file.

Order doesn't matter. There is no inheritance, no fallthrough, no scope chain.

So if you have a User-agent: * block and a User-agent: GPTBot block, GPTBot literally cannot see the wildcard block. It doesn't know it exists.

Why This Is Worse for AI Crawlers

For traditional Googlebot, this mistake usually exposes admin pages and internal search results. Annoying, but not catastrophic. Google handles low-value pages reasonably well.

AI crawlers don't have that filter.

GPTBot, ClaudeBot, and similar training crawlers grab content and feed it directly into training datasets. Once your internal search URLs end up in a training corpus - including whatever queries your users typed - you have no realistic way to remove them.

There's no equivalent of Search Console's removal tool for OpenAI.

It gets worse if you've added named blocks specifically to allow AI crawlers in for visibility. You wanted them to see your blog posts and product pages. They now also see your shopping cart and that staging directory you forgot about three migrations ago.

I've audited sites where this single mistake exposed thousands of internal URLs to AI bots.

The audit tools most people use don't flag it. The syntax is technically valid. You'll get a green checkmark on a robots.txt that's quietly leaking everywhere.

Keep in mind that the AI crawler list keeps growing. GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, and more.

Every single one of those names, once written into your file, escapes your wildcard block by default.

How to Fix It

First question: do you actually need different rules for that bot?

If GPTBot should follow the same rules as everyone else, you don't need a named block at all. Delete it. The wildcard will handle GPTBot automatically.

User-agent: *
Disallow: /admin/
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /private/

That's a complete robots.txt when every bot, AI or otherwise, follows the same rules.

You only need a per-bot block when you want different behavior for that specific bot. Blocking ClaudeBot from the entire site while letting others crawl normally. Blocking GPTBot from your documentation while still letting Google index it for search. That kind of thing.

In that case, the per-bot block must be self-contained. Every path you want enforced has to be repeated inside it.

User-agent: ClaudeBot
Disallow: /

User-agent: GPTBot
Disallow: /admin/
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /private/
Disallow: /docs/

User-agent: *
Disallow: /admin/
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /private/

Yes, it's repetitive. Yes, it scales badly. There's no inheritance mechanism in robots.txt.

That's a known limitation of the standard.

If you're maintaining a long list of disallows across many bots, generate the file from a template instead of editing by hand. Single source of truth for the paths beats copy-paste every time.

robots.txt looks like one of the simplest files in the SEO toolkit. Plain text. A few keywords. Surely you can't get it wrong.

You can. And I see it constantly. Especially now that everyone is rushing to add AI crawler rules without checking how the existing rules interact with them.

The mistake isn't a typo. It's a misunderstanding of how the spec actually works.

Wildcard groups don't stack. Per-bot groups override them entirely. Every path you want blocked needs to be listed explicitly for every bot you've named.

That's not a robots.txt best practice. That's the robots.txt specification.

If your file has both a User-agent: * block and any named bot, audit it today. There's a real chance the bots you wanted to control are running free.