Monthly Archives: April 2012

On Page Factors Correlation Data

Sometimes the least interesting results are the most interesting.

Long gone are the days of stuffing meta data and H1 tags full of keywords and getting high rankings, all SEOs worth their salt know that. What we will see, over the course of this post is a shocking lack of correlation for the traditional factors that are drilled into all newcomers to SEO by a media machine fuelled by outdated information.

Many industry leading pundits have touted such factors, and sentiment surveys run by SEOMoz show that many experts are likely to be massively out of touch with the true state of SEO.

In the coming days we will also see some fascinating results for other apparently less “in vogue” factors.

In this post we will examine the correlations I have found for 41 on page factors using Spearman’s Rank Correlation Coefficient and shed some subjective light on the possible meaning of these findings.

If you haven’t already, I highly recommend reading up on what a correlation really means, viewing some of my earlier findings to be used as a benchmark against the below findings and if this is your first time on the site make sure you read “what the project is all about“.

Once again I will warn of the dangers of taking correlations at face value and the importance of a more holistic view of the data below, the above links explain my reasoning.

Quick Recap

TheOpenAlgorithm is a project to bring more science to SEO. The initial phase is a correlation study examining the relationship between the ranking of the top #100 results for over 12,000 keywords and over 150 factors that might impact ranking in Google.

I’ve already published a bunch of data and will continue to do so in the near future.

Looking more long-term, I hope to test more factors on more data and ultimately prove causation between factors and ranking in Google.

There are also some people who have been vital to the success of the project, who I’d like to thank.


Chart: On Page Factors' CorrelationsDescription: Tags:

For the eager beavers among us you may enjoy seeing the individual correlation calculations for each keyword I examined, here’s a handy excel sheet with everything you need.


The first thing that strikes you with regard to the chart is the startlingly low correlations for each of the above factors.

Lets go through the most important correlations:

Title Tags

Chart: Title Tag Related FactorsDescription: Tags:

Historically title tags are a favourite of SEOs. Consistently sighted as important, even the steady demise in the value of on page factors doesn’t seem to have affected their weight in the average webmaster’s mind.

In fact the two factors with the highest sentiment in a recent Moz survey were; having the keyword present in the title tag and having it at the start of the title. The correlations in both this and the Moz study, almost directly contradict the industry’s leading thinkers.

The correlation of all of the factors related to the title tag recorded above are so close to zero they can be considered random.

Classic factors such as having the keyword in your title tag, starting the title with a keyword or using variations of the keyword multiple times all showed near random correlation.

What this likely means is that, these factors have little or no bearing on the ranking of a web page in Google.

Meta Keywords

Chart: Meta Keywords Related FactorsDescription: Tags:

Despite knowing that Google don’t take the meta keywords into account when ranking a web page, I decided to test related factors for two reasons, the first being to check if there were penalties for using it and the second as a handy benchmark for what can be considered a random correlation in this field of factors.

As we can see having the keyword in the meta keywords is the 4th most positively correlated factor out of the 41, thus anyone trying to justify the importance (based on correlation data) of any other factors of similar levels of correlation is silly.

There doesn’t seem to be any penalty incurred as a result of use of the meta keywords tag. Although I wouldn’t rule it out, abusing the tag likely being one of several ingredients Google may use as a signal for possible low-quality, over-optimized or spam content.

Meta Description

Chart: Meta Description Related FactorsDescription: Tags:

Again one of the great stalwarts of old-SEO is shown to have little or no value to the ranking of a web page. But before you throw out the meta description tag I must recognise that it has other value in terms of allowing you to control (to some degree) the highly important snippet shown to users on Google SERPs.

Heading Tags

Chart: Heading Tags Correlation DataDescription: Tags:

The use of keywords in H1, H2, H3 and H4 tags is one of webmaters’ more popular factors in search engine optimization, despite lately being discounted as less important by a number of SEOs.

We see no significant correlation for these factors in our study and thus no reason to consider them important or vital to the successful SEO strategy utilised by the intelligent webmaster.


Chart: Image Related FactorsDescription: Tags:

One of my old personal favourites; keyword usage in img names, alts, and titles has been consistently trending for quite a while now, primarily due to their relatively reduced popularity in the pre/early Google hay-days of SEO, resulting in their exclusion from the typical list of on page factors Google may be ignoring.

While these factors may be important in getting traffic from Google images it is appears likely that their status is indifferent towards increased search engine ranking.


Chart: Links On Page FactorsDescription: Tags:

While I would have liked to test some more advanced factors related to outbound and internal links, unfortunately that will have to wait until the second iteration of the study as I lacked the required computing power to test such factors.

For the factors I did test, some of which were fairly original and I was quite excited about, I once again saw close to random correlation.

Furthermore I can advise against using abnormal strategies in terms of anchor text or nofollow tags of outbound or internal links for the SEO benefit of the linking page. I recommend linking in the most user-friendly manner and using the nofollow tag for its intended purpose of flagging potentially editorially unsound links.

Unique Tags

Chart: Unique HTML TagsDescription: Tags:

I was also pretty interested to see the correlation for the presence of the canonical tag on a page, which I thought might be a good sign of an astute webmaster who is protecting against potential duplicate content penalties, but I was also worried about a possible cancelling effect caused by web pages groomed for Google utilising the tag the most. I saw no significant correlation, though I would still advise you to use the tag to protect against Google crawler errors.

Noframes and noscript tags are good indicators of user conscious sites using iframes or Javascript respectively. These factors are easy to test indicators of sites using these technologies. In the future I may look at different methods to test directly whether utilising iframes or Java impacts ranking but for now no definite conclusion can be drawn.

Rel=author and rel=me were new tags implemented to allow writers to identify their work on various sites and to allow Google to use their information to provide more relevant search results to users. In addition they are of course very closely related to a rather hot-topic of late -Google + profiles and their importance to SEO. The mere presence of these tags have no correlation but perhaps in future studies, the power of the linked Google + profile may be a factor I will test.


Chart: HTML LengthDescription: Tags:

The length in HTML and the HTML within the <body> tag were the highest correlated factors, in fact with correlations of .12 they could be considered somewhat if not hugely significant.

While these factors probably are not implemented within the algorithm, they are good signs of what Google is looking for; quality content, which in many cases means long or at least sufficiently lengthy pages.


The consistent lack of significant correlation for on page/HTML related factors many of which are reputed to be highly important to SEO is an indictment of some of the poor information available to SEOs.

You’re relevant, but are you quality?

One mitigating factor stopping us all abandoning the old school SEO strategies is a theory that may explain some of the low correlations. That is that once Google identifies a web page, based on some on page factors, as relevant to a keyword, they then may throw that to one side (or insignificantly weight the relevance level) and then look at other factors such as links, website authority, social media, etc as a measure of quality.

What this would mean would be that it would still be important (although less so) to use keywords in some of these areas to make sure Google understands you are relevant to the query and then focus on other higher correlated factors.

This is of course merely a theory and not one that would require much implementation, assuming that if you are targeting a keyword you will already have quality content relevant to that keyword.

Tests on more results per query e.g. top 1,000 or 10,000 results/query, or on a set of less competitive keywords may be more revealing in the viability of this theory and thus without such tests or causal data available I can not be 100% sure in my devaluation of the importance of these factors, but the data available would strongly suggest such a devaluation.

If you can manipulate it, Google probably aren’t using it

Arising from these results and the sustained trend of Google moving away from easily manipulated factors I have come to a pretty common sense and simple conclusion: that if a factor is directly and easily manipulated by a webmaster, the weight assigned to that factor is likely to be relatively low.

This would therefore presumably apply to an initiative I believe to be of high importance to search engine optimization. Thus this rule of thumb may be proven incorrect or vindicated in the future, I guess we will just have to wait and see.

Trust the numbers, not the opinions

While I am sure that this post will be fairly controversial as it runs against many long-held beliefs within the industry, the findings have been backed up by the similar results seen in the SEOMoz study and by Google’s continued stressing of the importance of focussing on quality not arbitrary on page factors.

Thus we must learn to trust the numbers. While of course numbers aren’t perfect and I have outlined potential flaws, at least they aren’t open to manipulation by human emotions based on untested and unvindicated opinions that have outlasted their sell-by-date.

These results in particular show the need for emotionless analysis of SEO theories and the value or lack thereof opinions and observations not backed by scientifically significant levels of data.

Was Matt Right?

How many times has Matt Cutts said not to focus on individual on page factors but to focus on creating a page of value for the user?

Many of us, fooled into believing in the unquestionable value of these factors,  including myself saw this as a classic attempt to shape our ideas of “what SEO is”, and Cutts trying to stop people from manipulating weaknesses in the Google algorithm.

As it turns out he was probably speaking the gospel that I personally and many others needed, yet ignored.

The Self-Fulfilling Cycle of Industry Drivel

One clear observation arising from the above results is the need to question commonly-held beliefs not just within SEO but everywhere.

The problem with SEO is that search engines are always changing, theories established 3 years ago may be outdated and need continued testing and re-assessment.

But what we have seen within some areas of the industry is a belief becoming so ingrained in SEO’s psyche that it is near taboo to question it. It is indoctrinated into every individual entering the industry.

As a result with no particular scientific proof authors, including myself, wrote about the importance of such factors, and strategies to maximise gain from the manipulation of these factors.

This only further fuels the fire and increases the apparent importance of these arbitrary data points.

Essentially its a house of cards, and hopefully this data and further tests will prove to people that these factors just aren’t that important and as a result the house will tumble, although, I suspect such a major change in the beliefs of an industry are harder changed than this.


While I don’t rule out the viability of on page SEO, I believe that Google use much higher level factors than we within the industry have thought about or used. I am aware of enhanced relevancy algorithms such as TD*IDF and think these efforts at better understanding Google are admirable.

On the whole I am fairly pessimistic about the value of on page SEO under current theories and strategies. I would suggest SEOs put no more than 5% of their time on the subject, much of which should be focussed on creating a structure that germinates well optimized content (similar to many of the WordPress features).

I would recommend focussing marketing budgets on creating unique, high quality content, optimized for the user and focus the majority of the SEO effort on building a beautiful, fast website architecture, link building, social media exposure and other well correlated factors.