Every URL Can’t be Crawled, Say Google. Here’s What You Can Do.

3 min read

In a recent Reddit discussion, things spiralled quickly when one of the users posted a question asking why the links pointing to a website are not getting discovered through the SEO tool they have been using. Google’s John Mueller was quick to step in and answered that it was near impossible for any tool out there to discover 100% of the links pointing to a website since there are virtually infinite number of them. And that makes it impossible for Google to crawl the whole web since no one, not even Google has that kind of resources to keep track and maintain a database of all the URLs.

John said.

“There’s no objective way to crawl the web properly. It’s theoretically impossible to crawl it all, since the number of actual URLs is effectively infinite. Since nobody can afford to keep an infinite number of URLs in a database, all web crawlers make assumptions, simplifications, and guesses about what is realistically worth crawling.

And even then, for practical purposes, you can’t crawl all of that all the time, the internet doesn’t have enough connectivity & bandwidth for that, and it costs a lot of money if you want to access a lot of pages regularly. Past that, some pages change quickly, others haven’t changed for 10 years – so crawlers try to save effort by focusing more on the pages that they expect to change, rather than those that they expect not to change.”

Just what is worth crawling then?

John then went to explain that there is an unofficial bar that every website must try to reach in order to be crawled and discovered. And that that is how search engines decide what URLs are worth spending resources on in order to crawl them. Simply speaking, it works on the basis of merit. If crawlers and search engines feel there is some value in crawling and bringing it to the public eye, every website must offer something valuable. He continued.

“And then, we touch on the part where crawlers try to figure out which pages are actually useful. The web is filled with junk that nobody cares about, pages that have been spammed into uselessness. These pages may still regularly change, they may have reasonable URLs, but they’re just destined for the landfill, and any search engine that cares about their users will ignore them.

Sometimes it’s not just obvious junk either. More & more, sites are technically ok, but just don’t reach “the bar” from a quality point of view to merit being crawled more.”

No fixed rules for crawling the web, but there is some system.

Mueller then went on to say that there are no fixed rules on how crawlers crawl the web, how often do they crawl a URL and what dark part of the web they completely oversee. He further added that that every SEO tool has its own way of deciding which URLs to crawl and how often to crawl them. He concluded.

“Therefore, all crawlers (including SEO tools) work on a very simplified set of URLs, they have to work out how often to crawl, which URLs to crawl more often, and which parts of the web to ignore. There are no fixed rules for any of this, so every tool will have to make their own decisions along the way. That’s why search engines have different content indexed, why SEO tools list different links, why any metrics built on top of these are so different.”

How to get your website to be crawled by Google more often?

There are a few things you can do at your end to increase your chances of your URLs getting crawled quickly in the first place and then crawled regularly and affectively.

  1. Make sure every page on your website is unique and offers value
  2. Remove low-quality pages that are just there to fill the voids and add no real value
  3. Create a sitemap and submit it to Google Search Console
  4. Create high-quality links and remove bad or broken links
  5. Add relevant internal links and fix no-follow internal links

Source: Search Engine Journal