The Science of Correlation Studies

Lab scientist representing science in search engine optimizationThe initial part of my project to bring more science to SEO is based on doing a correlation analysis on individual potential factors within the Google algorithm.

An incredibly important part of the project is making an impact on how SEOs and webmasters do business, a core to taking action based on correlation studies is to have an understanding of what a correlation study is, what are its weaknesses and what’s good for.

Don’t worry if you don’t know anything about statistics, maths, science, SEO or programming its not required to understand what a correlation is. Heck I didn’t know what it was until a couple of months ago.

I’ll try to keep the jargon to a minimum and hopefully you will be an expert on Spearman (woops that’s jargon) by the end of the post.

A Relationship

A correlation is just a fancy name for a relationship. Basically what I am trying to prove is:

  • Is there a relationship between a factor and ranking well in Google?
  • If so how big is the relationship?
  • And is it a positive relationship (helps rankings) or is it negative (lowers rankings)?

To be technical a correlation is a relationship between two variables.

In our case the two variables will always be the ranking of a page in Google and whatever factor were testing for.

Throughout this article I’ll use PageRank as our example factor.

So if I was trying to prove a relationship between PageRank and ranking in Google I would go to Google take the top 10, 20, 50, 100 whatever number of results in Google and find the PageRank score for each of these results.

I would then calculate if there was a relationship, the size of the correlation and the type (positive or negative).

But how do you calculate that?……

Spearman’s Rank Correlation Co-efficient

Spearman’s Correlation Co-efficient is one of the many maths formulas used to calculate correlation.

For maths junkies the reason I’m not using the others is because Spearman’s doesn’t assume a linear relationship between the variables. That just means that the type of data I’ll be testing may not be as well suited to the other formulas as Spearman’s.

Here’s the formula looks like: Spearman's Rank Correlation Co-efficient (the formula)

I won’t go into what each part means, but if your interested here’s a great video that explains it for stats newbies.

The basic idea is that you feed your two variables into this formula and it gives you back a number between -1 and 1.

What the number symbolises is the actual relationship/correlation:

  • If the sign is negative i.e. between 0 and -1, then the relationship is negative (hurts your rankings) e.g. maybe the page loading speed will show a negative correlation because as loading speed increases, ranking should decrease. And if it is positive i.e. between 0 and +1 then the relationship is positive e.g. PageRank should have a positive correlation.
  • Think of it this way,if one increases and the other increases then the correlation should be positive and if one increases while the other decreases then the correlation should be negative.
  • The closer you are to one, either positive or negative one the stronger the relationship. I.e. a correlation of 0.4 is a stronger correlation than 0.25. And the same goes for negative numbers -.5 means the correlation is negative and it would be worse for your site’s ranking than a -.15 correlation.
  • If you get a correlation at 0 or close to 0 it means there is little or no correlation.

To recap the correlation will be between -1 and 1. Negative number means negative relationship and vice versa. The closer the number is to either -1 or 1 the stronger the relationship. A correlation at or close to 0 means there is no or a very weak relationship.

Example Spearman Calculation

In this example (totally fictitious) we are looking at the top 10 results in a Google search. And we have gone and found the PageRank for each of these results. Below is the table representing the two variables.

Ranking PageRank

#1                    6

#2                    7

#3                    4

#4                    3

#5                    5

#6                    4

#7                    3

#8                    2

#9                    1

#10                  1

When you feed this information into the formula you get: a correlation of 0.894 (remember this is only an example).

This correlation would mean there is a very strong positive relationship between PageRank and ranking well in Google.

What I do is calculate the Spearman correlation for each of the top 100 results for 12,000+  searches and whatever factor I am testing for. I then get the mean (average) correlation, and that’s the one we’re interested in.

Gathering the Data

Binary code and several laptops representing large vats of dataSo we’re going to go a bit of the track here and explain how I gather my data to be analysed.

First thing I did was get myself a list of over 12,000 keywords, 800 from each of the 22 categories on Google Adwords Keywords Tool and remove the duplicates.

Then I wrote a program to go and get the top 100 results for each keyword so approx. 1.2 million URLs. Then I removed Google News, Images, Videos, Maps and Products results because they are generated by a slightly different algorithm and we can’t control them anyway.

Then I choose a factor to test, in this case PageRank. I figure out a way to get the desired data for each factor. For PageRank I would write a program that goes to the web asks a special URL for the PageRank of each URL and store it in a database.

Then I take the PageRank ranking and the search engine ranking and calculate the Spearman for each keyword. I store the correlation for each keyword, bring it into an Excel sheet and get the average.

Obviously each factor requires a lot of thought, some programming and a ton of testing but that’s the basic process.

Weakness – Correlation is not Causation

Having a relationship doesn’t mean a factor is in the Google algorithm nor does it mean that there will be some indirect increase in ranking.

For example there might be a positive relationship between Facebook likes of a page and ranking well. But it is possible (if not likely) that Google isn’t counting this factor in the algorithm and the positive correlation is there because pages that satisfy other factors e.g. links, on page factors, etc. may get more Facebook likes.

There is also a possibility that the factor has a knock on effect. For example the more Facebook likes a page gets, the more people see it, the more normal links it gets, the more PageRank, the higher the ranking.

So it is possible to get a correlation without the factor being in the algorithm.

Another possibly more relatable example is that of a football player. There would probably be a strong correlation between being a professional music artist and going to lots of high profile parties. That doesn’t mean if you start going to these parties then suddenly you will have an amazing voice. This is a classic case of correlation not causation. Going to the parties doesn’t cause you to be a successful singer its just a bi-product.

This is extremely important, correlation is only a guideline. The correlation needs to be viewed critically and tested in other means to be able to say more definitively whether it is a factor in the great Google algorithm or not.

Counter Argument – SEO’s not About Causation

From a scientific point of view this is a flaw with the correlation study. It still holds serious scientific significance but if your goal is to prove whether a factor is indeed a factor in the Google algorithm then correlation studies are only a very good guideline.

But if your goal is to increase the ranking of a site or find out what it takes to rank higher, which is what SEOs are interested in then correlation studies have greater value.

In the Facebook likes example where the increase in Facebook likes didn’t directly cause an increase in ranking it affected other factors which did, does this mean as an SEO you should ignore Facebook likes?

No, in this case it means that you should try and increase your Facebook likes to cause that knock on effect.

So really it depends what your goal is.

If you are trying to figure out what it takes to rank high in Google then they are very good (but not perfect). If your goal is to prove whether it is a factor in the Google algorithm, their fairly good but of less value.

 

I hope I did a good job at explaining correlation and that you understand it better now. The main takeaways are that correlation is code for relationship between two things e.g. ranking and a factor such as PageRank. That the correlation is between -1 and 1, with negative numbers meaning a negative correlation and vice versa, and that the closer you get to either 1, the stronger the relationship.

And then that correlation results should be looked at as a great guide for how to do SEO and what’s important to focus on but not the holy grail and should be tested further.

I believe this type of science and hard hitting statistics in SEO will vastly improve the industry by providing credible data and information that’s far more reliable than guesses or observations over one or two websites. I’m looking forward to sharing with you my findings and seeing what interesting tests arise from the correlations.

2 thoughts on “The Science of Correlation Studies

  1. Icel

    Hey Mark,
    I got to your site via a link on SEOmoz – kudos on that.
    Really like what you’re doing here. Keep up the good work, hope to see you develop this project farther.

    Reply
  2. Icel

    Hey Mark,
    I got to your site via a link on SEOmoz – kudos on that.
    Really like what you’re doing here. Keep up the good work, hope to see you develop this project farther.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>