Google’s John Mueller answered a query about why Google indexes pages which are disallowed from crawling by robots.txt and why the it’s secure to disregard the associated Search Console stories about these crawls.
Bot Site visitors To Question Parameter URLs
The particular person asking the query, Rick Horst (LinkedIn profile) documented that bots have been creating hyperlinks to non-existent question parameter URLs (?q=xyz) to pages with noindex meta tags which are additionally blocked in robots.txt. What prompted the query is that Google is crawling the hyperlinks to these pages, getting blocked by robots.txt (with out seeing a noindex robots meta tag) then getting reported in Google Search Console as “Listed, although blocked by robots.txt.”
The particular person requested the next query:
“However right here’s the massive query: why would Google index pages once they can’t even see the content material? What’s the benefit in that?”
Google’s John Mueller confirmed that if they will’t crawl the web page they will’t see the noindex meta tag. He additionally makes an attention-grabbing point out of the positioning:search operator, advising to disregard the outcomes as a result of the “common” customers received’t see these outcomes.
He wrote:
“Sure, you’re appropriate: if we will’t crawl the web page, we will’t see the noindex. That stated, if we will’t crawl the pages, then there’s not lots for us to index. So when you may see a few of these pages with a focused web site:-query, the typical person received’t see them, so I wouldn’t fuss over it. Noindex can also be effective (with out robots.txt disallow), it simply means the URLs will find yourself being crawled (and find yourself within the Search Console report for crawled/not listed — neither of those statuses trigger points to the remainder of the positioning). The vital half is that you just don’t make them crawlable + indexable.”
Associated: Google Reminds Web sites To Use Robots.txt To Block Motion URLs
Takeaways:
1. Affirmation Of Limitations Of Web site: Search
Mueller’s reply confirms the restrictions in utilizing the Web site:search superior search operator for diagnostic causes. A type of causes is as a result of it’s not linked to the common search index, it’s a separate factor altogether.
Google’s John Mueller commented on the positioning search operator in 2021:
“The brief reply is {that a} web site: question just isn’t meant to be full, nor used for diagnostics functions.
A web site question is a particular sort of search that limits the outcomes to a sure web site. It’s mainly simply the phrase web site, a colon, after which the web site’s area.
This question limits the outcomes to a particular web site. It’s not meant to be a complete assortment of all of the pages from that web site.”
The location operator doesn’t replicate Google’s search index, making it unreliable for understanding what pages Google has listed or observe listed. Like Google’s different superior search operators, they’re unreliable as instruments for understanding something associated to how Google ranks or indexes content material.
2. Noindex tag with out utilizing a robots.txt is ok for these sorts of conditions the place a bot is linking to non-existent pages which are getting found by Googlebot. Noindex tags on pages that aren’t blocked by a disallow within the robots.txt permits Google to crawl the web page and skim the noindex directive, making certain the web page received’t seem within the search index, which is preferable if the purpose is to maintain a web page out of Google’s search index.
3. URLs with the noindex tag will generate a “crawled/not listed” entry in Search Console and received’t have a adverse impact on the remainder of the web site.
These Search Console entries, within the context of pages which are purposely blocked, solely point out that Google crawled the web page however didn’t index it, basically saying that this occurred, not (on this particular context) that there’s one thing improper that wants fixing.
This entry is helpful for alerting publishers for pages which are inadvertently blocked by a noindex tag, or by another trigger that’s stopping the web page from being listed. Then it’s one thing to analyze
4. How Googlebot handles URLs with noindex tags which are blocked from crawling by a robots.txt disallow however are additionally discoverable by hyperlinks.
If Googlebot can’t crawl a web page, then it’s unable to learn and apply the noindex tag, so the web page should still be listed primarily based on URL discovery from an inner or exterior hyperlink.
Google’s documentation of the noindex meta tag has a warning about the usage of robots.txt to disallow pages which have a noindex tag within the meta knowledge:
“For the noindex rule to be efficient, the web page or useful resource should not be blocked by a robots.txt file, and it needs to be in any other case accessible to the crawler. If the web page is blocked by a robots.txt file or the crawler can’t entry the web page, the crawler won’t ever see the noindex rule, and the web page can nonetheless seem in search outcomes, for instance if different pages hyperlink to it.”
5. How web site: searches differ from common searches in Google’s indexing course of
Web site: searches are restricted to a particular area and are disconnected from the first search index, making them not reflective of Google’s precise search index and fewer helpful for diagnosing indexing points.
Learn the query and reply on LinkedIn:
Why would Google index pages once they can’t even see the content material?
Featured Picture by Shutterstock/Krakenimages.com