Yes, you read that right and no it is not a clickbait!
This is the first time I have encountered an issue with GSC’s robots.txt tester. The tool claimed the URL(s) was blocked in robots.txt checker but as per URL inspector tool, GSC Crawl Stats and server logs, the page(s) were getting crawled.
The Back Story
While working on the monthly crawls for one of our clients we noticed that the crawler was taking too long for some reason. We had witnessed a spider trap on this particular project and wondered if it might have popped up again.
The URL(s) in question were a particular pattern of parameterized URLs generated from the faceted navigation. Something that you could find easily on most Ecommerce sites.
This was a medium sized site with around 3000+ pages with URLs generated from faceted navigation blocked intentionally.
With Screaming Frog crawler set to respect robots.txt, parameterized URLs wouldn’t be crawled normally. This hinted there was a problem with robots.txt instructions.
Please note: We’ll only discuss Googlebot in this article.
The first obvious check was using the GSC’s robots.txt tester to verify if the URLs were indeed blocked or not in the robots.txt.
As shown in the screenshot below, GSC’s robots.txt tester says the URL pattern is blocked for Googlebot.
After the robots.txt tester says it’s blocked, the next check was to review the URL in the URL Inspector tool.
As shown in the screenshot below, the URL inspector says Crawl Allowed as Yes.
So far it was looking like the URL(s) pattern in question can be crawled by Google. GSC’s Coverage reports already suggested that the URL had been crawled a while back.
However, can be crawled != is crawled.
Next step – confirm if the URL pattern was actually getting crawled by Googlebot or not actively.
As you will find in the screenshots below, both GSC Crawl Stats and Server Logs suggest the URLs were getting crawled.
The above command in terminal counts the occurrence of string “param_color” in the log file “googlebot.log”
All evidence points to the fact that these URLs were getting crawled and it was undesired.
To confirm I tested the robots.txt instructions without any changes on two other external validators.
One of them marked the URL as allowed.
The other one marked it as blocked just like GSC’s robots.txt tester.
Here are the original instructions and the parameter in question is “param_color”.
User-agent: * User-agent: AhrefsBot User-agent: dotbot User-agent: Yandex User-agent: MJ12bot User-agent: SemrushBot Crawl-delay: 5 User-agent: msnbot User-agent: bingbot Crawl-delay: 1 #Parameter handling Disallow: *?param_color=* … #Rest of the instructions which do no matter here
Clearly, the issue was how the instructions were grouped.
As per robots.txt specifications related to grouping, the above instructions mean that all instructions after #Parameter handling apply only to the third user group i.e.
User-agent: msnbot User-agent: bingbot Crawl-delay: 1
This would mean the instruction Disallow: *?param_color=* doesn’t apply to Googlebot.
However, if you test these instructions on GSC’s robots.txt tester, it shows the URL will be blocked for Googlebot as shown in image 1 above.
Our assumption is that since Googlebot doesn’t consider “crawl-delay” the robots.txt tester behaves as if the crawl-delay instructions do not exist. One can say GSC’s robots.txt tester sees it like this:
User-agent: * User-agent: AhrefsBot User-agent: dotbot User-agent: Yandex User-agent: MJ12bot User-agent: SemrushBot User-agent: msnbot User-agent: bingbot #Parameter handling Disallow: *?param_color=* … #Rest of the instructions which do no matter here
Note: extra line breaks without any instructions are of no meaning and are discarded. This entire instruction set is treated as one single group by GSC robots.txt tester.
However, in reality, Googlebot reads the crawl-delay line and understands it is a logical instruction but simply chooses to ignore it. This makes it obey the grouping rules stated in the specifications.
The fix was simple after this discovery. Repeat the group of instructions for each of the three groups.
We tested the URL in the URL Inspector tool to ensure the fix was legit. Once the URL inspector validated that the URL can be crawled, we updated the robots.txt.
This interesting case is an example of how there is a mismatch between what GSC’s robots.txt says and how the actual Googlebot behaves. If you feel a URL is getting crawled even though it is blocked in robots.txt, do a Live test in GSC’s URL Inspector. If it says the URL can be crawled, then it means something is wrong with your robots.txt.
I hope this article helps you with some of the issues that you could run into while setting up your robots.txt. If you are unsure of how efficiently your website has been set up from an SEO perspective, an SEO audit might be the best path to pursue.
Please feel free to write to us at firstname.lastname@example.org if you need any assistance with making your website SEO prepared.
In the vast realm of social media, where trends come and go like passing storms, a remarkable transformation is taking …
What is Brand Authority and Why Does it Matter? The concept of brand authority is used to describe a brand’s …
Web developers are always looking for new techniques to improve the performance of websites because speed and user experience are …
After only one day of being launched, Elon Musk threatened to sue Meta, the parent company of Threads, for releasing …
As website owners say goodbye to Universal Analytics, transitioning smoothly from Universal Analytics (UA) to Google Analytics 4 (GA4) is …
What is a Product Feed? A product feed, also known as a data feed or an inventory feed, is a …