Gary Illyes from Google confirms that robots.txt does not safeguard websites against unauthorized access.
Google’s Gary Illyes confirmed a widely recognized issue: robots.txt is limitedly effective in controlling unauthorized access by crawlers. He then provided an overview of access control measures that all SEOs and website owners should know.
Microsoft Bing’s Fabrice Canel responded to Gary’s post, highlighting that Bing often encounters websites attempting to hide sensitive areas with robots.txt, inadvertently exposing these URLs to hackers. Canel remarked:
“Indeed, we and other search engines frequently encounter issues with websites that directly expose private content and attempt to conceal the security problem using robots.txt.”
Common Argument About Robots.txt
It seems that any time the topic of Robots.txt comes up, someone always points out that it can’t block all crawlers.
Gary Illyes from Google agreed with that point:
“Robots.txt can’t prevent unauthorized access to content” is a common argument popping up in discussions about robots.txt nowadays. This claim is valid, but I don’t think anyone familiar with robots.txt has claimed otherwise.”
Next, he took a deep dive into deconstructing what blocking crawlers means. He framed the process as choosing a solution that inherently controls or cedes control to a website. It’s about access requests (from browsers or crawlers) and how the server responds. He listed examples of control:
- A robots.txt (leaves it up to the crawler to decide whether or not to crawl).
- Firewalls (WAF, aka web application firewall – firewall controls access).
- Password protection.
Here are his remarks:
“If you need access authorization, you need something that authenticates the requester and controls access. Firewalls may authenticate based on IP, your web server based on credentials handed to HTTP Auth or a certificate to its SSL/TLS client, or your CMS based on a username and a password, and then a 1P cookie.
There’s always some information that the requester passes to a network component that will allow it to identify the requester and control its access to a resource. Robots.txt, or any other file hosting directives for that matter, hands the decision of accessing a resource to the requester, which may not be what you want. These files are more like those annoying lane control stanchions at airports that everyone wants to barge through, but they don’t.
There’s a place for stanchions, but there’s also a place for blast doors and irises over your Stargate.
TL;DR: don’t think of robots.txt (or other files hosting directives) as a form of access authorization; use the proper tools for that; there are plenty.”
If you still need help with this, check out our monthly SEO packages and let the experts help you.