Monthly Archives: June 2012

Finally Killing Off Keyword Density

As much as any SEO worth their salt knows that keyword density is not a ranking factor, there are some out there that still believe it is a signal in the Google algorithm or somehow is related to ranking well in Google.

Often this myth is perpetrated by the untrained eye, or sold by the snake-oil salesmen looking to oversimplify the Googe algorithm in the hopes of screwing some SEO noobs out of a few bucks.

This short article’s goal is not to shock you with this amazing new revelation but to provide a single scientifically backed piece of proof that keyword density is to be ignored. This article is a handy link for all the SEO consultants who have clients with notions about keyword density and its importance.

Anecdotal evidence

Let’s think about keyword density and Google’s goals logically.

Google’s goal as a search engine is to provide relevant, useful results to users. That’s why users love Google and keep coming back for more and that’s the only reason Google surged to dominance in the search engine field, not marketing budgets, not clever tricks but providing the best results for search queries.

As a user which would you prefer; a page that consistently and methodically mentions the same keyword or a page that uses a similar word of the same meaning to make the writing sound more natural and flow? The second page right?

So why would Google reward a page that is less useful to users than one that is more useful when that runs against their core goal as a search engine?

They wouldn’t.

Scientific evidence

I recently completed a study of over 12,000 keywords. Comparing data on various search engine factors to the ranking of the top 100 results for each keyword in Google. In total I looked at 1.2 million web pages. I used Spearman’s Rank Correlation Coefficient to compare the data on the various factors I tested with each web page’s ranking in Google.

For example I figured out the keyword density for each of the 1.2 million web pages and compared that with each of those page’s ranking.

Spearman (the statistical measure I used) gives you a number between -1 and 1 representing the nature and strength of the relationship between keyword density and ranking well in Google.

A minus number means there is a negative relationship, i.e. when keyword density decreases, ranking in Google increases.

As that number gets closer to either -1 or 1 the strength of the relationship increases. So a number near zero means there is no relationship between keyword density and ranking well in Google.

As it turns out my study showed that the correlation between keyword density and ranking well in Google is -0.028126693, that means there is pretty much no relationship between keyword density and ranking well in Google and if there is a small relationship then it is a negative one.

But what would a number be without a chart? Let’s compare keyword density’s correlation with some other factors I tested:

Chart: Keyword Density ComparedDescription: Tags: Author:

The final blow

There is no better form of proof than having logic and science confirmed by mother Google. Here’s what Matt Cutts, Google’s #1 webmaster spokesman has to say on the matter – “Keyword Density: Not really a factor. Yes the keyword should be present but density is not important. Include the keyword but make writing sound natural.”

If logic, science and Google all say keyword density doesn’t matter, then it doesn’t matter, so don’t believe anybody who tells you it does and stop hounding your SEO guy/gal about it.

Why Google’s Algorithm Doesn’t Care What You Write

Most SEOs understand that on page factors have and will continue to decline in importance within the Google algorithm and SEO.

We all know that Google takes no notice of the meta keywords tag, and little notice of other tags, markups and HTML structures which has been backed up with my data.

But “in content” factors i.e. ranking signals related to the actual text/content on a page, would appear to be a separate matter.

Spamming the meta description doesn’t hurt a user’s experience but spamming the actual content the user sees is detrimental to the user and therefore not worth doing, right? Surely damaging the user’s experience isn’t worth those few extra visitors from Google, seen as they will probably convert less due to the comparatively less helpful content?

It sounds good and makes sense, and as a result we in the SEO community have come to the conclusion that while Google might ignore those other on page factors, they probably have some really smart methods to figure out what a page is about and the quality of the content on that page.

I decided to look at 5 really basic factors that you would think may feature in some element of the Google algorithm or would be closely related to a ranking factor.

While I understand there are better and more advanced methods (LDA, TF*IDF) for comparing the content on a page to a given keyword or judging the quality of a given piece of content I tested these really basic factors to judge Google’s likely weighting of “in content” factors.

I may test more advanced topic modelling algorithms/factors in the future but for now I have stuck to some old information retrieval reliables.

You are probably already aware of the general method for my correlation studies but if this is your first time here please read this and this.

Special thank you to Mike Tung of Diffbot for providing me with free access to their article API which is undoubtedly the best text extraction service/algorithm out there. And as if the solidify that point congrats to the team on their recent 2 million dollar funding round.

Data

 

Chart: In Content Factors Correlation DataDescription: Tags: Author:

This data is based on a dataset of over 1.2 million web pages from over 12,000 unique keywords and the correlations are derived from Spearman’s Rank Correlation Coefficient.

Analysis

Images and Videos

Having images and videos on a page is generally accepted to be good for the user. What I was interested in, was seeing whether this translated into increased Google ranking or not, and apparently it doesn’t.

Assuming Google likes pages with images and videos (large but reasonable assumption), this is quite a good test of how low-level Google are willing to go to promote pages that are in line with what they want to see i.e. are Google willing to reward these pages in the algorithm or do they prefer the great PageRank algorithm to help them identify what users like and want.

Page Contains Bad Words

This is another fascinating test, I checked whether a page contains any bad word from this list.

I thought that using these naughty words would surely relegate you from the search results or at least lower your ranking, but the data doesn’t support that.

Unfortunately there is a cautionary note, Google likely run a somewhat more advanced algorithm over content, instead of checking for just the presence of these bad words they probably look at their likely intent and as a result either ban a page or give it no penalty.

A news article quoting a foul mouthed sports star shouldn’t be banned from the search results because of its harmless and informative intent.

Because of this the pages that weren’t excluded from the search results and therefore likely received no penalty were the only ones to show up in my dataset.

As a result it would be unfair to draw conclusions regarding Google’s implementation of bans/penalties towards pages using these bad words.

Is the Keyword Even in the Content?

Most of us would think that this test is a sure thing. Forget keyword density, but surely having the keyword in the content of a page is absolutely vital to that page ranking well for that keyword but again the data says otherwise.

How can this be? The most basic and obvious steps for checking whether a page is about a keyword is to check whether the page contains that keyword, how else would Google narrow down their massive index into something more manageable?

Well there is anchor text, meta description, title tags and many other areas that Google may look at to check whether a page is about a keyword or not.

But what this astoundingly low correlation suggests is not only that Google likely doesn’t implement such a factor (when ranking pages) but also that Google probably isn’t using other super-advanced topic modelling algorithms, as most of these algorithms are based on the assumption that the keyword is in the content and all of them are based on the assumption that there is textual content.

Distance to 1st Keyword Match

I was a little more sceptical about this factor correlating well and rightly so. This old-school factor might have been in use in the days of Alta-Vista but most of us would agree its not so likely to be around any more.

Summary

While other topic modelling algorithms might correlate higher than the above factors most of them are based on the simple assumption that a page contains the keyword you are trying to model for and that there is textual content, which are dangerous assumptions for Google to make.

Nobody can make a blanket statement like, “Google don’t analyse what you write and don’t care what’s in the content of a page” but the data does point us in that direction.

If you are to accept the theory that Google don’t take such a low-level and specific view of pages or at least don’t weight such a view very highly then it is easy to come up with reasonable justification for that theory.

For example if Google takes such a low-level view then how does it understand infographics or video blogs? How will such an algorithm scale with the web as it evolves further into a multi-media and not just a textual internet?

I don’t believe the data is in anyway conclusive and I do believe that other “clever” topic modelling algorithms may correlate well with ranking but whether or not that means Google implements such factors within their algorithm is another debate.

What I will say is that I believe Google most likely take a much higher-level view of pages than we think, using links, social media, PageRank and other more scalable factors to determine the relevance and quality of a web page rather than looking at on page or in content factors.

As a result I would recommend that all webmasters create content, title tags and web pages with the user and not the search engine in mind and optimize for other far more scalable factors like link building, social media, page loading speed, etc.