Today, we noticed that a page on a client's site that has been blocked in the robots.txt is showing PageRank in the toolbar.

URL for offending page

However, when I went to view the cached version of the page, the URL in the cache was for the correct page:

This may be a result of Google realising that this is the same page and showing the correct version in the index. The bigger issue, however, is why Google is even aware of the URL since it's been told to KEEP OUT of the /demo/ folder altogether.

This isn't the only site where we've noticed the problem of blocked pages appearing in Google's index:

Google showing blocked pages 1

and

Google showing blocked pages 2

are just two of the sites we found with pages in the index.

So what can we conclude? It looks like Google doesn't entirely honour the robots.txt file and in future we should consider password-protecting these pages instead.

What should Google learn from this - web design and web development companies consider it evil when you say you won't spider anything blocked in the robots.txt and then you do it anyway.

Discussion

Posted by JohnMu on
Hi Mindy,
Regarding the first URL, it looks like the robots.txt did not block us from crawling it back then, so we'll generally keep the old cache in our index in a situation like that. You can have it removed by allowing us to crawl it and using an appropriate robots meta tag.

The other URLs are how we generally index URLs that we are disallowed from crawling -- we just keep the URL alone. If you wish to have the URLs removed from the index you can either use the URL removal tool in Webmaster Tools or the appropriate robots meta tag.

Hope it helps!
Posted by Mindy on
Hi John

Thanks so much for the explanation. It's certainly helpful!

I'm still confused as to why Google would be showing a DIFFERENT URL when you use the toolbar to view the cache to the page that you're actually on at the time.

I also do not understand how having a robots.txt to tell Google to exclude a page is not enough to keep Google from ignoring the page entirely.

For example, the pages listed from Farleys.com did not exist prior to their robots.txt file yet it's showing up for a "site:" search. I've already asked for it to be excluded yet you're now telling me I have to go in and ask again using a different method? It doesn't make a great deal of sense.
Leave a Reply



(Your email will not be publicly displayed.)