fbpx

The Mystery Behind Google’s Indexing of Blocked Content

3 min read

Google’s John Mueller clarifies why disallowed pages occasionally get indexed and advises that specific Search Console reports regarding these pages can be disregarded.

 

Google’s John Mueller addressed a question about why pages blocked from crawling by robots.txt are still indexed and explained why it’s safe to ignore the related Search Console reports about those crawls.

 

Bot Traffic to Query Parameter URLs

 

A user reported that bots were generating links to non-existent query parameter URLs (e.g., ?q=xyz) on pages with noindex meta tags also blocked by robots.txt. The issue arose because Google was crawling these links, getting blocked by the robots.txt file (without seeing the noindex tag), and then flagging them in Google Search Console as “Indexed, though blocked by robots.txt.” This prompted the user to ask:

“But here’s the big question: why would Google index pages when they can’t see the content? What’s the advantage in that?”

In response, Google’s John Mueller confirmed that if Google can’t crawl the page, they can’t see the noindex meta tag. He also made an interesting point about the site

operator, advising not to worry about the results because the average user won’t see those indexed pages.

 

Mueller explained:

“Yes, you’re correct: if we can’t crawl the page, we can’t see the noindex. That said, if we can’t crawl the pages, there’s not much for us to index. So, while you might see some of those pages with a targeted site query, the average user won’t see them, so I wouldn’t fuss over it. Noindex is also fine (without robots.txt disallow); it just means the URLs will end up being crawled (and end up in the Search Console report for crawled/not indexed — neither of these statuses causes issues to the rest of the site). The important part is that you don’t make them crawlable + indexable.”

Mueller’s response reassures that while query parameter URLs blocked by robots.txt may be reported as indexed, they are not an issue for overall site visibility as long as they are not both crawlable and indexable.

 

Takeaways:

 

1. Confirmation of Limitations of Site: Search

Mueller’s response highlights the limitations of using the site

operator for diagnostics. This tool doesn’t reflect Google’s regular search index but functions separately, making it unreliable for assessing what Google has indexed.

 

Advanced search operators like site

are not connected to the main search index and should not be relied on to understand how Google ranks or indexes content.

 

2. Noindex Tag Without Robots.txt is Preferable

Using a noindex tag without blocking the page via robots.txt is recommended in cases where bots link to non-existent pages that Google discovers. This allows Google to crawl the page and recognize the noindex directive, ensuring it won’t appear in search results. This approach is more effective if the goal is to keep a page out of Google’s index.

 

3. “Crawled, Not Indexed” Entries in Search Console Are Not Harmful

URLs with a noindex tag will appear as “crawled, not indexed” in Search Console, but this doesn’t negatively affect the rest of the site. These entries indicate that Google crawled the page but didn’t index it, which is fine when the noindex tag is used intentionally. This report can also help identify pages accidentally blocked from indexing, allowing publishers to investigate potential issues.

 

4. How Google Handles Noindex Tags on Robots.txt-Blocked Pages

If a page is blocked from crawling by a robots.txt file but linked elsewhere, Googlebot may still discover and index the page based on that link. Since Googlebot can’t crawl the page, it won’t be able to see and apply the noindex tag, leading to unintended indexing.

Google’s documentation on the noindex meta tag warns:

“For the noindex rule to be working, the page or resource should not be blocked by a robots.txt file. If blocked, the crawler can’t access the page to see the noindex rule, and the page could still appear in search results, especially if linked to from other pages.”

 

5. Site: Search vs. Regular Search in Google’s Indexing

Site: searches are limited to a particular domain and are disconnected from Google’s main search index, meaning they don’t reflect the actual state of indexed pages. This makes them unreliable for diagnosing indexing or ranking issues, as they operate independently from the standard search process.

 

If you’re still finding it all a bit confusing, consider our monthly SEO packages and let the experts handle it for you.

Shilpi Mathur
[email protected]