fbpx

The Future of SEO After the Google Content Warehouse API Leak

9 min read

If you missed it, 2,569 internal documents related to Google’s internal services have leaked.

A search marketer named Erfan Amizi brought them to Rand Fishkin’s attention, and we have analyzed them thoroughly.

Pandemonium ensued.

As you can imagine, the past 48 hours have been incredibly chaotic, and my attempt at taking a vacation has been utterly unsuccessful.

Naturally, a segment of the SEO community has quickly fallen into the usual cycle of fear, uncertainty, and doubt.

Reconciling new information can be challenging, and our cognitive biases can hinder us.

Discussing this further and offering clarification to use what we’ve learned more productively is essential.

After all, these documents provide the clearest insight into how Google evaluates page features.

In this article, I aim to be more explicit, address common questions, critiques, and concerns, and highlight additional actionable findings.

Finally, I want to offer a glimpse into how we will use this information to perform cutting-edge work for our clients. We hope to collectively develop the best ways to update our best practices based on what we’ve learned.

 

Reactions to the Leak: My Thoughts on Common Criticisms

 

Let’s start by addressing what people say in response to our findings. I’m not one to subtweet, so this is to all of you, and I say this with love. 😆

 

‘We already knew all that.’

No, in large part, you did not.

Generally speaking, the SEO community has operated based on a series of best practices from research-minded people from the late 1990s and early 2000s.

For instance, we’ve held the page title in such high regard for so long because early search engines were not full-text and only indexed the page titles.

These practices have been reluctantly updated based on information from Google, SEO software companies, and insights from the community. You filled numerous gaps with your own speculation and anecdotal evidence from your experiences.

If you’re more advanced, you capitalized on temporary edge cases and exploits, but you never knew exactly the depth of what Google considers when it computes its rankings.

You also did not know most of its named systems, so you would not have been able to interpret much of what you see in these documents. So, you searched these documents for what you understood and concluded that you know everything here.

That is the very definition of confirmation bias.

In reality, these documents have many features that none of us knew.

Like the 2006 AOL search data leak and the Yandex leak, the value will be captured from these documents for years. Most importantly, you also just got actual confirmation that Google uses features that you might have suspected. There is value in that if only you could act as proof when trying to implement something with your clients.

Finally, we now have a better sense of internal terminology. One way Google spokespeople evade explanation is through language ambiguity. We are now better armed to ask the right questions and stop living on the abstraction layer.

 

‘We should just focus on customers and not the leak.’

Sure. As an early and continued proponent of market segmentation in SEO, we should focus on our customers.

Yet we can’t deny that we live in a reality where most of the web has conformed to Google to drive traffic.

We operate in a channel that is considered a black box. When our customers ask us questions, we often respond, “It depends.”

I believe there is value in having an atomic understanding of what we’re working with so we can explain what it depends on. That helps with building trust and getting buy-in to execute our work.

Mastering our channel is in service of our focus on our customers.

 

‘The leak isn’t real’

Skepticism in SEO is healthy. Ultimately, you can decide to believe whatever you want, but here’s the reality of the situation:

Erfan had his Xoogler source authenticate the documentation. Rand worked through his authentication process. I also authenticated the documentation separately through my network and backchannel resources. I can confidently say that the leak is accurate and verified in several ways, including insights from people with deeper access to Google’s systems.

In addition to my sources, Xoogler Fili Wiese offered his insight on X. I’ve included his call out even though he vaguely sprinkled some doubt on my interpretations without providing any other information. But that’s a Xoogler for you, am I right? Finally, the documentation references specific internal ranking systems only Googlers know about. I touched on some of those systems and cross-referenced their functions with detail from a Google engineer’s resume.

Oh, and Google verified it in a statement as I made my final edits.

 

‘This is a Nothingburger’

No doubt.

I’ll see you on page 2 of the SERPs while I have mine medium with cheese, mayo, ketchup, and mustard.

 

‘It doesn’t say CTR, so it’s not being used.’

So, let me get this straight: you think a marvel of modern technology that computes an array of data points across thousands of computers to generate and display results from tens of billions of pages in a quarter of a second that stores both clicks and impressions as features are incapable of performing fundamental division on the fly?

… OK.

 

‘Be careful with concluding this information’

I agree with this. Due to the caveats that I highlighted, we all have the potential to be wrong in our interpretation here.

To that end, we should take measured approaches in developing and testing hypotheses based on this data.

My conclusions are based on my research into Google and precedents in Information Retrieval. Still, it is entirely possible that my findings are not correct.

 

‘The leak is to stop us from talking about AI Overviews.’

No.

The misconfigured documentation deployment happened in March. There’s some evidence that this has happened in other languages (sans comments) for two years.

The documents were discovered in May. Had someone discovered it sooner, it would have been shared sooner.

The timing of AI Overviews has nothing to do with it. Cut it out.

 

‘We don’t know how old it is.’

This is immaterial. Based on the dates in the files, it’s at least newer than August 2023.

We know that commits to the repository happen regularly, presumably as a function of updated code. We also know that much of the documentation has stayed the same in subsequent deployments.

We also know that when this code was deployed, it featured precisely the 2,596 files we have been reviewing, and many of those files were not previously in the repository. Unless whoever/whatever did the git push did so with out-of-date code, this was the latest version at the time. The documentation has other markers of recency, like references to LLMs and generative features, which suggests that it is at least from the past year. Either way, it has more detail than we have ever seen and is fresh enough for our consideration.

 

‘This all isn’t related to the search.’

That is correct. I indicated as much in my previous article.

I should have segmented the modules into their respective services, but I took the time to do that now.

 

‘It’s just a list of variables’

Sure.

It’s a list of variables with descriptions that give you a sense of the level of granularity Google uses to understand and process the web.

If you care about ranking factors, this documentation is Christmas, Hanukkah, Kwanzaa, and Festivus.

 

‘It’s a conspiracy! You buried [thing I’m interested in].’

Why would I bury something and encourage people to review the documents and write about their findings?

Make it make sense.

 

‘This won’t change anything about how I do SEO’

This is a choice and, perhaps, a function of my purposely needing to be more prescriptive with how I presented the findings.

What we’ve learned should at least enhance your approach to SEO strategically in a few meaningful ways and can change it tactically. I’ll discuss that below.

 

FAQs about the Leaked Docs

 

I’ve fielded many questions in the past 48 hours, so I believe it’s valuable to compile the answers here.

 

What were the most exciting things you found?

It’s all intriguing, but one finding not mentioned in the original article stands out:

Google can set limits on results per content type.

They can specify only a certain number of blog posts or news articles to appear for a given SERP. Understanding these diversity limits could influence our choice of content formats when targeting keywords.

For example, if we know the limit for blog posts is three and we doubt our ability to outrank any of them, a video would be a more feasible format for that keyword.

 

What should we take away from this leak?

Search operates with layers of complexity. Although we have a broader view, we still need insight into which elements of the ranking systems trigger specific actions or why. These documents offer clarity on signals and their nuances.

 

What are the implications for local search?

Andrew Shotland and his team at Local SEO Guide are delving into this perspective.

 

What about implications for YouTube Search?

Though I have yet to explore it, the documents have 23 modules with YouTube prefixes. It’s ripe for interpretation.

 

How does this impact the (_______) space?

It isn’t easy to ascertain. Google’s scoring functions behave differently based on queries and contexts. Different ranking systems activate for various verticals. For instance, the QualityTravelGoodSitesData module identifies and scores travel sites for a boost over non-official sites.

 

Is Google purposely harming small sites?

I can’t say for sure. However, there’s evidence of small sites losing traffic, impacting small businesses. Signals like links and clicks favor big brands, perpetuating their dominance. Google could level the playing field by boosting small sites against big brands.

 

Do you think Googlers are bad people?

No, they’re generally well-meaning individuals navigating a challenging landscape. The power dynamic between Googlers and the SEO community stems from information disparity. Googlers could improve their reputation by providing more transparent responses.

 

Is there anything related to AI Overviews?

While I didn’t find direct links, I’ve shed light on its workings before.

Is there anything related to generative AI?

There are features tied to video content. LLMs predict video topics based on associated attributes.

 

How Google Can Repair Its Relationship with the SEO Community

 

Many have asked how we can mend our relationship with Google moving forward.

It is essential to return to a more productive space to enhance the web. Ultimately, our goals align to search better.

While I don’t claim a complete solution, an apology and acknowledgment of their role in misdirection would be a positive start. Here are a few other ideas worth considering:

 

Develop Working Relationships with Us: While Google understandably avoids showing favoritism in the organic realm, fostering genuine relationships with the SEO community is crucial. Implementing a structured program with objectives and key results (OKRs) akin to how other platforms engage with influencers could be beneficial. Interactions are somewhat ad hoc, with select individuals invited to events like I/O or exclusive meetings during the (now-defunct) Google Dance.

 

Bring Back the Annual Google Dance: Reintroduce this event with Lily Ray as the DJ, focusing on celebrating annual OKRs achieved through our partnership.

 

Collaborate on Content: Strengthen bidirectional relationships like those cultivated by individuals like Martin Splitt through various video series. More collaborative efforts between Google and the SEO community can yield significant improvements.

 

Increase Engineer Engagement: Hearing directly from search engineers has proven immensely valuable. Presentations like Paul Haahr’s at SMX West 2016 remain memorable and informative. Encouraging more direct communication from the source would benefit us all.

 

Everybody, Keep Up the Good Work

 

In the past 48 hours, I’ve witnessed some remarkable contributions from the SEO community.

I’m invigorated by the enthusiasm with which everyone has engaged with this material and shared their perspectives— even when opinions differ. This kind of discourse is vital and exemplifies what makes our industry unique.

I urge everyone to continue onward. We’ve been honing our skills for this moment throughout our careers. If you still find it all challenging and perplexing, consider exploring our monthly SEO packages and let our experts assist you.

Shilpi Mathur
navyya.shilpi@gmail.com