We’ve all been asking this for a while now, definitely out of curiosity and not a possible intent for foul play. Is there a magic number, or rather, a cursed figure, beyond which Google decides the content on a particular website is duplicate? Google’s very own and our favourite John Mueller answered that in the week gone by. The topic once again sparked interest after one of the users took to Twitter to ask John if Google even measures something like that, and uses it to identify original content and dismiss duplicate. The short answer from John was a solid “No”. But we’ll throw light on the issue by digging deeper and highlight different opinions coming from Google’s end. Let’s begin, shall we?
So, is there actually a percentage cut-off that marks content as duplicate?
A Twitter user Bill Hartzer specifically asked John Mueller the question.
“Hey @johnmu is there a percentage that represents duplicate content? For example, should we be trying to make sure pages are at least 72.6 percent unique than other pages on our site? Does Google even measure it?”
John’s reply was quick and to the point.
“There is no number (also how do you measure it anyway?).”
Matt Cutts, Google’s software engineer back in 2013 also shared some very interesting insights into the topic. Yes, all of that is close to 10 years old but still holds much relevance. In an official video posted on Google Search Central’s YouTube channel, he said and we quote:
“It’s important to realize that if you look at content on the web, something like 25% or 30% of all the web’s content is duplicate content. …People will quote a paragraph of a blog and then link to the blog, that sort of thing.”
He went on to add that much of the duplicate content doesn’t bear any malicious intent to spam or content theft. Google then doesn’t penalise such pages. Further adding that penalising websites with some duplicate content could potentially ruin the quality of search results. Therefore, it tries to filter out pages with duplicate content from the search results.
“What Google does when it finds duplicate content is try to group it all together and treat it as if it’s just one piece of content. It’s just treated as something that we need to cluster appropriately. And we need to make sure that it ranks correctly.” concluded Matt.
Interestingly, Google isn’t looking at percentages but checksums.
In November 2020’s Search Off t he Record Podcast that lets you in on the inner workings of Google, the same issue was discussed at length by three Google stalwarts John Mueller, Gary Illyes, and Martin Splitt. And it was revealed that when it comes to detecting and dealing with duplicate content, it is all about checksums and not percentages. Additionally, there is also a distinction between when only part of the content is duplicate and the entire piece of content is duplicate.
Here’s what the discussion between Gary Illyes and Martin Splitt highlighted. When Martin asked if detection of duplicate content and canonicalization were the same concept, Gary said:
“Well, it’s not, right? Because first you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other, and then you have to basically find a leader page for all of them. And that is canonicalization. So, you have the duplication, which is the whole term, but within that you have cluster building, like dupe cluster building, and canonicalization.”
So, if you are legit quoting someone or something from another piece of content from another website, you can count yourself safe. It is a recognised practice designed to put stronger emphasis on a given piece of content. But, if you have intentionally or unintentionally posted a whole piece of duplicate content, you are in for trouble. You can listen to the discussion at 06:44 or go through its typescript.