Why Google’s Algorithm Doesn’t Care What You Write

Most SEOs understand that on page factors have and will continue to decline in importance within the Google algorithm and SEO.

We all know that Google takes no notice of the meta keywords tag, and little notice of other tags, markups and HTML structures which has been backed up with my data.

But “in content” factors i.e. ranking signals related to the actual text/content on a page, would appear to be a separate matter.

Spamming the meta description doesn’t hurt a user’s experience but spamming the actual content the user sees is detrimental to the user and therefore not worth doing, right? Surely damaging the user’s experience isn’t worth those few extra visitors from Google, seen as they will probably convert less due to the comparatively less helpful content?

It sounds good and makes sense, and as a result we in the SEO community have come to the conclusion that while Google might ignore those other on page factors, they probably have some really smart methods to figure out what a page is about and the quality of the content on that page.

I decided to look at 5 really basic factors that you would think may feature in some element of the Google algorithm or would be closely related to a ranking factor.

While I understand there are better and more advanced methods (LDA, TF*IDF) for comparing the content on a page to a given keyword or judging the quality of a given piece of content I tested these really basic factors to judge Google’s likely weighting of “in content” factors.

I may test more advanced topic modelling algorithms/factors in the future but for now I have stuck to some old information retrieval reliables.

You are probably already aware of the general method for my correlation studies but if this is your first time here please read this and this.

Special thank you to Mike Tung of Diffbot for providing me with free access to their article API which is undoubtedly the best text extraction service/algorithm out there. And as if the solidify that point congrats to the team on their recent 2 million dollar funding round.

Data

 

Chart: In Content Factors Correlation DataDescription: Tags: Author:

This data is based on a dataset of over 1.2 million web pages from over 12,000 unique keywords and the correlations are derived from Spearman’s Rank Correlation Coefficient.

Analysis

Images and Videos

Having images and videos on a page is generally accepted to be good for the user. What I was interested in, was seeing whether this translated into increased Google ranking or not, and apparently it doesn’t.

Assuming Google likes pages with images and videos (large but reasonable assumption), this is quite a good test of how low-level Google are willing to go to promote pages that are in line with what they want to see i.e. are Google willing to reward these pages in the algorithm or do they prefer the great PageRank algorithm to help them identify what users like and want.

Page Contains Bad Words

This is another fascinating test, I checked whether a page contains any bad word from this list.

I thought that using these naughty words would surely relegate you from the search results or at least lower your ranking, but the data doesn’t support that.

Unfortunately there is a cautionary note, Google likely run a somewhat more advanced algorithm over content, instead of checking for just the presence of these bad words they probably look at their likely intent and as a result either ban a page or give it no penalty.

A news article quoting a foul mouthed sports star shouldn’t be banned from the search results because of its harmless and informative intent.

Because of this the pages that weren’t excluded from the search results and therefore likely received no penalty were the only ones to show up in my dataset.

As a result it would be unfair to draw conclusions regarding Google’s implementation of bans/penalties towards pages using these bad words.

Is the Keyword Even in the Content?

Most of us would think that this test is a sure thing. Forget keyword density, but surely having the keyword in the content of a page is absolutely vital to that page ranking well for that keyword but again the data says otherwise.

How can this be? The most basic and obvious steps for checking whether a page is about a keyword is to check whether the page contains that keyword, how else would Google narrow down their massive index into something more manageable?

Well there is anchor text, meta description, title tags and many other areas that Google may look at to check whether a page is about a keyword or not.

But what this astoundingly low correlation suggests is not only that Google likely doesn’t implement such a factor (when ranking pages) but also that Google probably isn’t using other super-advanced topic modelling algorithms, as most of these algorithms are based on the assumption that the keyword is in the content and all of them are based on the assumption that there is textual content.

Distance to 1st Keyword Match

I was a little more sceptical about this factor correlating well and rightly so. This old-school factor might have been in use in the days of Alta-Vista but most of us would agree its not so likely to be around any more.

Summary

While other topic modelling algorithms might correlate higher than the above factors most of them are based on the simple assumption that a page contains the keyword you are trying to model for and that there is textual content, which are dangerous assumptions for Google to make.

Nobody can make a blanket statement like, “Google don’t analyse what you write and don’t care what’s in the content of a page” but the data does point us in that direction.

If you are to accept the theory that Google don’t take such a low-level and specific view of pages or at least don’t weight such a view very highly then it is easy to come up with reasonable justification for that theory.

For example if Google takes such a low-level view then how does it understand infographics or video blogs? How will such an algorithm scale with the web as it evolves further into a multi-media and not just a textual internet?

I don’t believe the data is in anyway conclusive and I do believe that other “clever” topic modelling algorithms may correlate well with ranking but whether or not that means Google implements such factors within their algorithm is another debate.

What I will say is that I believe Google most likely take a much higher-level view of pages than we think, using links, social media, PageRank and other more scalable factors to determine the relevance and quality of a web page rather than looking at on page or in content factors.

As a result I would recommend that all webmasters create content, title tags and web pages with the user and not the search engine in mind and optimize for other far more scalable factors like link building, social media, page loading speed, etc.

6 thoughts on “Why Google’s Algorithm Doesn’t Care What You Write

  1. Brian

    Say a page would like to target a keyword, then: I think I’ve noticed that G likes it when that page has a link out to another website, with the targeted keyword in the outbound link anchor text, along with the targeted keyword being found on that other website’s linked to page. 

    Reply
  2. alanbleiweiss

     I think this is a totally flawed study and the notion is way off the mark.  First, just because a specific keyword isn’t in a page does not mean that content is not about the topical focus that generates from a particular keyword.  Did you do any research on topical relationship of keyword to content? 

    What about intent?  Extending a keywords root meaning to the topical intent of the content is another serious consideration that, while only in it’s infant stage, is fast becoming a very critical aspect and will only become more important in the coming years.

    Because of the intent factor, and the topical focus factor not specific to a keyword, the more garbage you have in your content, the less likely the page will be evaluated as relevant as Google and Bing get better at deciphering that intent. 

    And just like the major impact Panda and Penguin had on wiping out entire models of methodology, so too will future updates. 

    So as far as I’m concerned, this is yet another situation where junk data based on false premises could cause even more novices and hacks to think they found a “winning” formula for putting out more garbage on the web, only to see that garbage tactic burned down the road.

    Reply
    1. Mark Collier

      Hi Alan

      First of all this article doesn’t purport to suggest that I have or know a winning SEO formula, in fact the only piece of SEO advice given in the article is to focus on creating quality content for the user and build links, reduce page loading speed and use social media, all seemingly common sense suggestions with the basic premise being that you must create QUALITY content first, not an article suggesting you “putout more garbage on the web.”

      In addition as I thoroughly covered in the article, I haven’t yet looked at more advanced topic modelling factors/algorithms like LDA or TF*IDF and may do so in the future. As also explained thoroughly in the article, many of these topic modelling algorithms are based on the premise that the keyword is in the article and all are based on the premise that there is textual content available to analyse.

      Your “intent” a “topical relationships” would likely fall under one of the above or another advanced topic modelling or content analysis algorithms.

      Reply
  3. deskcoder

    Maybe I missed something, but I don’t understand the numbers at the bottom of the grid. So if a page contains images, it is going to rank a little lower in the SERPs?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>