On Page Factors Correlation Data

Sometimes the least interesting results are the most interesting.

Long gone are the days of stuffing meta data and H1 tags full of keywords and getting high rankings, all SEOs worth their salt know that. What we will see, over the course of this post is a shocking lack of correlation for the traditional factors that are drilled into all newcomers to SEO by a media machine fuelled by outdated information.

Many industry leading pundits have touted such factors, and sentiment surveys run by SEOMoz show that many experts are likely to be massively out of touch with the true state of SEO.

In the coming days we will also see some fascinating results for other apparently less “in vogue” factors.

In this post we will examine the correlations I have found for 41 on page factors using Spearman’s Rank Correlation Coefficient and shed some subjective light on the possible meaning of these findings.

If you haven’t already, I highly recommend reading up on what a correlation really means, viewing some of my earlier findings to be used as a benchmark against the below findings and if this is your first time on the site make sure you read “what the project is all about“.

Once again I will warn of the dangers of taking correlations at face value and the importance of a more holistic view of the data below, the above links explain my reasoning.

Quick Recap

TheOpenAlgorithm is a project to bring more science to SEO. The initial phase is a correlation study examining the relationship between the ranking of the top #100 results for over 12,000 keywords and over 150 factors that might impact ranking in Google.

I’ve already published a bunch of data and will continue to do so in the near future.

Looking more long-term, I hope to test more factors on more data and ultimately prove causation between factors and ranking in Google.

There are also some people who have been vital to the success of the project, who I’d like to thank.


Chart: On Page Factors' CorrelationsDescription: Tags:

For the eager beavers among us you may enjoy seeing the individual correlation calculations for each keyword I examined, here’s a handy excel sheet with everything you need.


The first thing that strikes you with regard to the chart is the startlingly low correlations for each of the above factors.

Lets go through the most important correlations:

Title Tags

Chart: Title Tag Related FactorsDescription: Tags:

Historically title tags are a favourite of SEOs. Consistently sighted as important, even the steady demise in the value of on page factors doesn’t seem to have affected their weight in the average webmaster’s mind.

In fact the two factors with the highest sentiment in a recent Moz survey were; having the keyword present in the title tag and having it at the start of the title. The correlations in both this and the Moz study, almost directly contradict the industry’s leading thinkers.

The correlation of all of the factors related to the title tag recorded above are so close to zero they can be considered random.

Classic factors such as having the keyword in your title tag, starting the title with a keyword or using variations of the keyword multiple times all showed near random correlation.

What this likely means is that, these factors have little or no bearing on the ranking of a web page in Google.

Meta Keywords

Chart: Meta Keywords Related FactorsDescription: Tags:

Despite knowing that Google don’t take the meta keywords into account when ranking a web page, I decided to test related factors for two reasons, the first being to check if there were penalties for using it and the second as a handy benchmark for what can be considered a random correlation in this field of factors.

As we can see having the keyword in the meta keywords is the 4th most positively correlated factor out of the 41, thus anyone trying to justify the importance (based on correlation data) of any other factors of similar levels of correlation is silly.

There doesn’t seem to be any penalty incurred as a result of use of the meta keywords tag. Although I wouldn’t rule it out, abusing the tag likely being one of several ingredients Google may use as a signal for possible low-quality, over-optimized or spam content.

Meta Description

Chart: Meta Description Related FactorsDescription: Tags:

Again one of the great stalwarts of old-SEO is shown to have little or no value to the ranking of a web page. But before you throw out the meta description tag I must recognise that it has other value in terms of allowing you to control (to some degree) the highly important snippet shown to users on Google SERPs.

Heading Tags

Chart: Heading Tags Correlation DataDescription: Tags:

The use of keywords in H1, H2, H3 and H4 tags is one of webmaters’ more popular factors in search engine optimization, despite lately being discounted as less important by a number of SEOs.

We see no significant correlation for these factors in our study and thus no reason to consider them important or vital to the successful SEO strategy utilised by the intelligent webmaster.


Chart: Image Related FactorsDescription: Tags:

One of my old personal favourites; keyword usage in img names, alts, and titles has been consistently trending for quite a while now, primarily due to their relatively reduced popularity in the pre/early Google hay-days of SEO, resulting in their exclusion from the typical list of on page factors Google may be ignoring.

While these factors may be important in getting traffic from Google images it is appears likely that their status is indifferent towards increased search engine ranking.


Chart: Links On Page FactorsDescription: Tags:

While I would have liked to test some more advanced factors related to outbound and internal links, unfortunately that will have to wait until the second iteration of the study as I lacked the required computing power to test such factors.

For the factors I did test, some of which were fairly original and I was quite excited about, I once again saw close to random correlation.

Furthermore I can advise against using abnormal strategies in terms of anchor text or nofollow tags of outbound or internal links for the SEO benefit of the linking page. I recommend linking in the most user-friendly manner and using the nofollow tag for its intended purpose of flagging potentially editorially unsound links.

Unique Tags

Chart: Unique HTML TagsDescription: Tags:

I was also pretty interested to see the correlation for the presence of the canonical tag on a page, which I thought might be a good sign of an astute webmaster who is protecting against potential duplicate content penalties, but I was also worried about a possible cancelling effect caused by web pages groomed for Google utilising the tag the most. I saw no significant correlation, though I would still advise you to use the tag to protect against Google crawler errors.

Noframes and noscript tags are good indicators of user conscious sites using iframes or Javascript respectively. These factors are easy to test indicators of sites using these technologies. In the future I may look at different methods to test directly whether utilising iframes or Java impacts ranking but for now no definite conclusion can be drawn.

Rel=author and rel=me were new tags implemented to allow writers to identify their work on various sites and to allow Google to use their information to provide more relevant search results to users. In addition they are of course very closely related to a rather hot-topic of late -Google + profiles and their importance to SEO. The mere presence of these tags have no correlation but perhaps in future studies, the power of the linked Google + profile may be a factor I will test.


Chart: HTML LengthDescription: Tags:

The length in HTML and the HTML within the <body> tag were the highest correlated factors, in fact with correlations of .12 they could be considered somewhat if not hugely significant.

While these factors probably are not implemented within the algorithm, they are good signs of what Google is looking for; quality content, which in many cases means long or at least sufficiently lengthy pages.


The consistent lack of significant correlation for on page/HTML related factors many of which are reputed to be highly important to SEO is an indictment of some of the poor information available to SEOs.

You’re relevant, but are you quality?

One mitigating factor stopping us all abandoning the old school SEO strategies is a theory that may explain some of the low correlations. That is that once Google identifies a web page, based on some on page factors, as relevant to a keyword, they then may throw that to one side (or insignificantly weight the relevance level) and then look at other factors such as links, website authority, social media, etc as a measure of quality.

What this would mean would be that it would still be important (although less so) to use keywords in some of these areas to make sure Google understands you are relevant to the query and then focus on other higher correlated factors.

This is of course merely a theory and not one that would require much implementation, assuming that if you are targeting a keyword you will already have quality content relevant to that keyword.

Tests on more results per query e.g. top 1,000 or 10,000 results/query, or on a set of less competitive keywords may be more revealing in the viability of this theory and thus without such tests or causal data available I can not be 100% sure in my devaluation of the importance of these factors, but the data available would strongly suggest such a devaluation.

If you can manipulate it, Google probably aren’t using it

Arising from these results and the sustained trend of Google moving away from easily manipulated factors I have come to a pretty common sense and simple conclusion: that if a factor is directly and easily manipulated by a webmaster, the weight assigned to that factor is likely to be relatively low.

This would therefore presumably apply to Schema.org an initiative I believe to be of high importance to search engine optimization. Thus this rule of thumb may be proven incorrect or vindicated in the future, I guess we will just have to wait and see.

Trust the numbers, not the opinions

While I am sure that this post will be fairly controversial as it runs against many long-held beliefs within the industry, the findings have been backed up by the similar results seen in the SEOMoz study and by Google’s continued stressing of the importance of focussing on quality not arbitrary on page factors.

Thus we must learn to trust the numbers. While of course numbers aren’t perfect and I have outlined potential flaws, at least they aren’t open to manipulation by human emotions based on untested and unvindicated opinions that have outlasted their sell-by-date.

These results in particular show the need for emotionless analysis of SEO theories and the value or lack thereof opinions and observations not backed by scientifically significant levels of data.

Was Matt Right?

How many times has Matt Cutts said not to focus on individual on page factors but to focus on creating a page of value for the user?

Many of us, fooled into believing in the unquestionable value of these factors,  including myself saw this as a classic attempt to shape our ideas of “what SEO is”, and Cutts trying to stop people from manipulating weaknesses in the Google algorithm.

As it turns out he was probably speaking the gospel that I personally and many others needed, yet ignored.

The Self-Fulfilling Cycle of Industry Drivel

One clear observation arising from the above results is the need to question commonly-held beliefs not just within SEO but everywhere.

The problem with SEO is that search engines are always changing, theories established 3 years ago may be outdated and need continued testing and re-assessment.

But what we have seen within some areas of the industry is a belief becoming so ingrained in SEO’s psyche that it is near taboo to question it. It is indoctrinated into every individual entering the industry.

As a result with no particular scientific proof authors, including myself, wrote about the importance of such factors, and strategies to maximise gain from the manipulation of these factors.

This only further fuels the fire and increases the apparent importance of these arbitrary data points.

Essentially its a house of cards, and hopefully this data and further tests will prove to people that these factors just aren’t that important and as a result the house will tumble, although, I suspect such a major change in the beliefs of an industry are harder changed than this.


While I don’t rule out the viability of on page SEO, I believe that Google use much higher level factors than we within the industry have thought about or used. I am aware of enhanced relevancy algorithms such as TD*IDF and think these efforts at better understanding Google are admirable.

On the whole I am fairly pessimistic about the value of on page SEO under current theories and strategies. I would suggest SEOs put no more than 5% of their time on the subject, much of which should be focussed on creating a structure that germinates well optimized content (similar to many of the WordPress features).

I would recommend focussing marketing budgets on creating unique, high quality content, optimized for the user and focus the majority of the SEO effort on building a beautiful, fast website architecture, link building, social media exposure and other well correlated factors.

Domain Name SEO

Cartoon of wrenches building a domain nameLast week I published data on 342,740 domains that I extracted from the dataset I have built for TheOpenAlgorithm project.

What I looked at was mostly from a user’s point of view, not very scientific but pretty significant. I wanted to show, what the user was used to/found normal in terms of domains. What domain TLD (.com, .org, .net, etc.) they expected and number crunching on average domain lengths.

If you don’t have time to read the post there were two basic findings.

  • If your buying a domain get a .com, turns out .com domains are heavily represented in the dataset with 78% of the domains I looked at being a .com. Even if you want a local TLD e.g. .fr or .ie or you want to go for .org, .net, .info, etc. its really important that you try and get your hands on the .com version of the domain, users are so used to it and assume a domain is .com unless its particularly obvious to them.
  • The second finding, if less interesting was that the average domain length was 15 characters long but the most common length being 8 characters. Obviously common sense applies here, shorter = better, but the domain must make sense to your audience.


Today I’m publishing my first correlation data!

If your not sure what a correlation study is or what its good for read this article!

A correlation shows the size of the relationship between ranking well and whatever factor I am testing.

All correlations are between -1 and +1, with a negative (-) number meaning a negative relationship i.e. that as you do more of it your rankings are likely to suffer. And a positive number means as you do more of it your rankings are likely to increase.

The closer the correlation is to either of the ones (-1 or 1) the stronger the relationship i.e. the more powerful/important the factor.

For example a 0.8 correlation is more significant than a 0.3.

Correlation studies aren’t perfect and don’t prove causation. If you haven’t heard of correlations before then this post on what they are, how to do one and what are their weaknesses will be very helpful.

I’ll be looking at a few factors at a time and then publishing the data on that set of factors. I’ll be posting more data on various factors over the coming weeks and months and I hope to break them down into actionable chunks and sections within the algorithm that SEOs/readers can digest them easily.

But today I’m looking at the factors that SEO’s and new webmasters should consider before they buy a domain name.

Domain TLD

Here’s the correlation data for what TLD of a domain had.

Chart: Domain TLDs/Endings CorrelationsDescription: Domain TLDs Correlation Data.Tags: Domain TLDs


The correlations are very low which likely means that there is little or no relationship, but the minor correlations do show a mild preference for .com and .org domains while .net, .info and .us were negatively correlated.

It’s hard to say what is causing this small correlation, we know Google has taken action against individual TLDs in the past and it seems probable that some TLDs are targeted as spam or credible but the small correlations seen below are more likely to be a combination of Google’s preferences in domain TLDs and the types of sites that register each domain TLD.

While the correlation data below isn’t that significant it does seem to be in line with other data I have seen (some of which is below) which leads me to believe that there is merit in it and if you were down to a straight shoot off between two domain TLDs (and couldn’t get both) this would be a neat tie-breaker.

Probably not something to get too worried about but interesting data.

As you will see throughout this post .com domains are way out in front and .org domains are the next best, but it appears that beyond those you are in the danger zone unless you are registering a local TLD like .fr or .co.uk.

Exact Match Domains

One of the most heralded and contentious factors among SEOs  is exact match domains i.e. domains that are the same as what the user searched for e.g. keyword: “the open algorithm”, exact match=http://www.theopenalgorithm.com.

Previous studies by SEOMoz have shown it to be a very powerful factor.

There are a number of points to be aware of before reviewing the findings.

Exact match domains make sense from a user’s point of view. If a domain is an exact match for the search query then it is likely to be relevant to the query. In addition it is possible that the user is searching specifically for that site.

There’s a potential anchor text boost in having an exact match domain. When you own an exact match domain it is likely that more sites will link to you with the keyword as the anchor text, for example I would imagine that tons of webmasters link to SearchEngineLand just as I did there, with the site’s name as its anchor text.

As a result there is the potential for the correlation for exact match domains to be slightly inflated due to its presumed benefit to anchor text related factors, unless Google have a clause in their algorithm to negate this effect.

If a domain is a company or some organisation then surely the fact that it is an entity is a better reason to show an exact match domain as opposed to purely the fact that it is an exact match domain. For example seomoz.org is highly relevant to the search query “seomoz”, but is seo.com the best for the search query “SEO”?

We know that Google do a reasonable job at figuring out whether a domain is an entity or not which some SEO’s believe is the reason why exact match domains do so well and why as an SEO tactic they might not be as powerful as data suggests. This is down to the fact that the domains that are entities e.g. “seomoz.org” are exact matches when people search for that entity. The argument is that these entity searches impact all the non-entity domains that happen to be exact matches in these types of studies and in my dataset.

Having said all that, I’ll let the data talk and you can infer your own conclusions.

Chart: Exact Match DomainsDescription: Tags: Author:


Comments: These correlations are really interesting and very significant. They are pretty much in line with SEOMoz’s results although they only tested for exact match and exact match .com domains. I have run tests on other less influential factors and comparing the above data to those results and the SEOMoz correlation study it seems that exact match domains are one of the most powerful factors, but they seem to have declined in significance over the last few years.

SEOMoz reported a 0.38 correlation for .com exact matches in their original study and then a 0.22 correlation in their latest, my data shows a further decline which seems important to note.

I would speculate that this decrease in the correlation of exact match domains can be attributed to Google refining their algorithms to detect when an exact match domain bought for ranking benefits and when it is an entity that deserves the push up in the rankings.

The correlation data broken down into the various domain TLDs is important because beyond .com and .org exact matches there is a significant drop off in terms of a relationship between having one and ranking better in Google.

There are a couple of reasons why this may the case:

  • Google may value these domain TLDs less and implement algorithms to penalise (or reward others more) for exact match domains.
  • The domains with the less common TLDs may have been bought because there is less demand for them and therefore getting a domain that the webmaster believes will rank well is easier. Thus it would be other algorithmic factors (potentially the entity extraction ones) that would penalise the non .com exact matches.

This information is very important if you are registering a domain, it seems highly likely that something, either directly in the algorithm or an indirect factor is causing .com exact matches to rank significantly higher than its counterparts.

I did the parse of the search results from proxies within the US which doesn’t have a national TLD (although .us is its official one, in practice its not widespread in use and doesn’t hold the value of other local TLDs) so its hard to tell if other local TLDs, for example .co.uk or .ie, which are more popular in their home countries would have similar correlations to .com exact match domains.

Correlation would most vary based on the country and how the local TLD was used and managed in that country.

In the future I hope to run crawls from proxies within other English speaking countries with these prominent local TLDs, and will then be able to answer that question.

Note: The amount of data available for .info and .us domains wan’t anywhere near as much as the other TLDs and thus the size of the scientific error is likely to be higher. I have a very large dataset (1.2 million URLs) and because of this and the fact that the results seem in line with both common sense and the TLD correlation data above I suggest that they are relatively accurate but are likely to be slightly less accurate than the .com, .org and .net correlations.

Hyphenated Exact Match Domains

Ah yes, the good old hyphenated exact match. Often portrayed as the next best option if you can’t get that exact match .com domain.

Lets see if that portrayal has merit:

Chart: Correlation Data for:Hyphenated Exact Match DomainsDescription: Hyphenated exact match domains correlation data.Tags:


Wow. That’s really interesting. Hyphenated exact match domains are just nowhere near as correlated as exact match domains without the hyphens.

Plus they aren’t very user friendly, so maybe its time to rethink our strategy on hyphens?

The only potential downfall vs. the non-hyphenated correlation data is that it is again even less likely to be an entity using hyphens in their domain name.

Once again we see .com’s on top with .org in second.

Note: due to insignificant quantities of data I didn’t test .info or .us domains for this factor.

More Data

Now we’re delving into some more unique factors.

Chart: More Domain Related Correlation DataDescription: More Domain Related Correlation DataTags:

Partial match domain is when you have the keyword in your domain but its not an exact match domain e.g. “tech” and techcrunch.com.

Partial match ratio mean what percentage of the domain is taken up with the partial match. With the Techcrunch example it would be: “tech” = 4 characters, domain (“techcrunch”) = 10 characters, partial match ratio = (4/10)*100 = 40%


Note: the * beside partial match domain, partial match ratio and keyword is first word in domain means that I have excluded exact match domains in calculating the correlation for these factors. This is just common sense because Google wouldn’t reward exact match domains twice for a very similar factor.

Also note that the negative correlations for the number of characters and hyphens in domain name means that as the length/number of characters/hyphens in the domain name the ranking of that domain decreases i.e. longer/more is worse, shorter/fewer = better.

I suspect that the number of characters in a domain name is not something Google worries about unless it penalises very large domains. But what this large negative correlation most likely shows, is that there are other factors that are impacted by having a large domain.

For example the social sharability of your domain is reduced because in social cyberspace shorter = better. Also websites probably lose out on that brand factor or potential type in traffic due to the increased character length.

With the low correlations for partial match domains and the partial match ratio, it appears as if having the keyword in some of your domain isn’t very beneficial. It’s either exact match or forget it.


There’s a ton of data to digest here with 20 factors tested.

.com domains, be they tested on their own or as part of exact and hyphenated exact match domains came out on top with .org consistently in second.

If your buying a domain, .com came out a convincing winner both the in correlation data and the user/usage data.

Exact match domains are very significantly correlated to ranking well but there is a significant drop off in influence for non .com domains. .org exact matches were relatively well correlated but beyond that there was a continued progression towards nearly no relationship.

It is likely that these exact match domains bought on less popular TLDs e.g. .info, .net are either targeted directly by Google for looking suspicious in that they are more likely to be bought for their ranking potential or they are bought for their ranking potential and Google penalises them with their entity detection algorithms or through other factors.

Hyphenated exact matches beyond .com ones held nearly no correlation.

Domains with fewer characters and less hyphens in the domain name did significantly better.

And having a partial match domain even a relatively well populated one had only marginal benefit.

Actionable takeaways

Of course the data doesn’t prove causation, but with some common sense and mental analytics I have come up with a handy list of takeaways for the next time you are buying a domain:

  • Buy .com, .org, or a local TLD (.com preferably).
  • Avoid other TLDs like the plague!
  • Search hard for an exact match, but don’t dilute the brand of the site to get one.
  • No hyphens please (unless absolutely necessary).
  • Shorter is way better.
  • If you can’t get an exact match, don’t compromise branding to get a partial match, its not worth it (although having your main keyword in the domain name might be a good branding idea).
  • Entities are important, own your space with marketing, PR, clever link building, microdata, etc.

Hope you enjoyed the post and if you have any thoughts, ideas, criticisms, possible explanations please leave a comment below.

The Science of Correlation Studies

Lab scientist representing science in search engine optimizationThe initial part of my project to bring more science to SEO is based on doing a correlation analysis on individual potential factors within the Google algorithm.

An incredibly important part of the project is making an impact on how SEOs and webmasters do business, a core to taking action based on correlation studies is to have an understanding of what a correlation study is, what are its weaknesses and what’s good for.

Don’t worry if you don’t know anything about statistics, maths, science, SEO or programming its not required to understand what a correlation is. Heck I didn’t know what it was until a couple of months ago.

I’ll try to keep the jargon to a minimum and hopefully you will be an expert on Spearman (woops that’s jargon) by the end of the post.

A Relationship

A correlation is just a fancy name for a relationship. Basically what I am trying to prove is:

  • Is there a relationship between a factor and ranking well in Google?
  • If so how big is the relationship?
  • And is it a positive relationship (helps rankings) or is it negative (lowers rankings)?

To be technical a correlation is a relationship between two variables.

In our case the two variables will always be the ranking of a page in Google and whatever factor were testing for.

Throughout this article I’ll use PageRank as our example factor.

So if I was trying to prove a relationship between PageRank and ranking in Google I would go to Google take the top 10, 20, 50, 100 whatever number of results in Google and find the PageRank score for each of these results.

I would then calculate if there was a relationship, the size of the correlation and the type (positive or negative).

But how do you calculate that?……

Spearman’s Rank Correlation Co-efficient

Spearman’s Correlation Co-efficient is one of the many maths formulas used to calculate correlation.

For maths junkies the reason I’m not using the others is because Spearman’s doesn’t assume a linear relationship between the variables. That just means that the type of data I’ll be testing may not be as well suited to the other formulas as Spearman’s.

Here’s the formula looks like: Spearman's Rank Correlation Co-efficient (the formula)

I won’t go into what each part means, but if your interested here’s a great video that explains it for stats newbies.

The basic idea is that you feed your two variables into this formula and it gives you back a number between -1 and 1.

What the number symbolises is the actual relationship/correlation:

  • If the sign is negative i.e. between 0 and -1, then the relationship is negative (hurts your rankings) e.g. maybe the page loading speed will show a negative correlation because as loading speed increases, ranking should decrease. And if it is positive i.e. between 0 and +1 then the relationship is positive e.g. PageRank should have a positive correlation.
  • Think of it this way,if one increases and the other increases then the correlation should be positive and if one increases while the other decreases then the correlation should be negative.
  • The closer you are to one, either positive or negative one the stronger the relationship. I.e. a correlation of 0.4 is a stronger correlation than 0.25. And the same goes for negative numbers -.5 means the correlation is negative and it would be worse for your site’s ranking than a -.15 correlation.
  • If you get a correlation at 0 or close to 0 it means there is little or no correlation.

To recap the correlation will be between -1 and 1. Negative number means negative relationship and vice versa. The closer the number is to either -1 or 1 the stronger the relationship. A correlation at or close to 0 means there is no or a very weak relationship.

Example Spearman Calculation

In this example (totally fictitious) we are looking at the top 10 results in a Google search. And we have gone and found the PageRank for each of these results. Below is the table representing the two variables.

Ranking PageRank

#1                    6

#2                    7

#3                    4

#4                    3

#5                    5

#6                    4

#7                    3

#8                    2

#9                    1

#10                  1

When you feed this information into the formula you get: a correlation of 0.894 (remember this is only an example).

This correlation would mean there is a very strong positive relationship between PageRank and ranking well in Google.

What I do is calculate the Spearman correlation for each of the top 100 results for 12,000+  searches and whatever factor I am testing for. I then get the mean (average) correlation, and that’s the one we’re interested in.

Gathering the Data

Binary code and several laptops representing large vats of dataSo we’re going to go a bit of the track here and explain how I gather my data to be analysed.

First thing I did was get myself a list of over 12,000 keywords, 800 from each of the 22 categories on Google Adwords Keywords Tool and remove the duplicates.

Then I wrote a program to go and get the top 100 results for each keyword so approx. 1.2 million URLs. Then I removed Google News, Images, Videos, Maps and Products results because they are generated by a slightly different algorithm and we can’t control them anyway.

Then I choose a factor to test, in this case PageRank. I figure out a way to get the desired data for each factor. For PageRank I would write a program that goes to the web asks a special URL for the PageRank of each URL and store it in a database.

Then I take the PageRank ranking and the search engine ranking and calculate the Spearman for each keyword. I store the correlation for each keyword, bring it into an Excel sheet and get the average.

Obviously each factor requires a lot of thought, some programming and a ton of testing but that’s the basic process.

Weakness – Correlation is not Causation

Having a relationship doesn’t mean a factor is in the Google algorithm nor does it mean that there will be some indirect increase in ranking.

For example there might be a positive relationship between Facebook likes of a page and ranking well. But it is possible (if not likely) that Google isn’t counting this factor in the algorithm and the positive correlation is there because pages that satisfy other factors e.g. links, on page factors, etc. may get more Facebook likes.

There is also a possibility that the factor has a knock on effect. For example the more Facebook likes a page gets, the more people see it, the more normal links it gets, the more PageRank, the higher the ranking.

So it is possible to get a correlation without the factor being in the algorithm.

Another possibly more relatable example is that of a football player. There would probably be a strong correlation between being a professional music artist and going to lots of high profile parties. That doesn’t mean if you start going to these parties then suddenly you will have an amazing voice. This is a classic case of correlation not causation. Going to the parties doesn’t cause you to be a successful singer its just a bi-product.

This is extremely important, correlation is only a guideline. The correlation needs to be viewed critically and tested in other means to be able to say more definitively whether it is a factor in the great Google algorithm or not.

Counter Argument – SEO’s not About Causation

From a scientific point of view this is a flaw with the correlation study. It still holds serious scientific significance but if your goal is to prove whether a factor is indeed a factor in the Google algorithm then correlation studies are only a very good guideline.

But if your goal is to increase the ranking of a site or find out what it takes to rank higher, which is what SEOs are interested in then correlation studies have greater value.

In the Facebook likes example where the increase in Facebook likes didn’t directly cause an increase in ranking it affected other factors which did, does this mean as an SEO you should ignore Facebook likes?

No, in this case it means that you should try and increase your Facebook likes to cause that knock on effect.

So really it depends what your goal is.

If you are trying to figure out what it takes to rank high in Google then they are very good (but not perfect). If your goal is to prove whether it is a factor in the Google algorithm, their fairly good but of less value.


I hope I did a good job at explaining correlation and that you understand it better now. The main takeaways are that correlation is code for relationship between two things e.g. ranking and a factor such as PageRank. That the correlation is between -1 and 1, with negative numbers meaning a negative correlation and vice versa, and that the closer you get to either 1, the stronger the relationship.

And then that correlation results should be looked at as a great guide for how to do SEO and what’s important to focus on but not the holy grail and should be tested further.

I believe this type of science and hard hitting statistics in SEO will vastly improve the industry by providing credible data and information that’s far more reliable than guesses or observations over one or two websites. I’m looking forward to sharing with you my findings and seeing what interesting tests arise from the correlations.

What I Learnt About 342,740 Domains

I recently did a parse of 12,573 keywords, extracting the top 100 results per keyword on Google. And after cleaning up the data I was left with over 1.2 million web pages and 342,740 unique domains.

For the last week or so I have been looking for interesting information within this mountain of data.

I published data on top domains, sites, Google’s use of Images, Products and News results and some strange URLs I noticed.

This data is part of my project to bring more science to SEO, initially by doing a correlation study into Google’s algorithm.

Domain Data

I was looking into domain related data and I spotted some interesting patterns, nothing ground breaking but just some stuff you might find cool.

I should have some domain related correlation data out later this week so this is a insight into the domain dataset.

The domain name and the ending for that domain is a really important choice for a new webmaster. When you make that choice your choosing your brand for life, plus its a super important decision from an SEO point of view.

There are a couple of considerations to take into account. The user and the search engine. The correlation data I’ll show you later this week should take care of the SEO point of view.

But from a user’s point of view you obviously want a domain that’s memorable, easy to type and easy to link to. And that in and of itself is a big factor in SEO. If people are linking to the wrong domain you’re losing out on valuable link juice. If people can’t type it then you’re losing out on type in traffic and if user’s can’t share it then say goodbye to some social media clicks.

Domain TLDs

The domain ending or TLD (Top Level Domain) is probably the most important part a website’s address.

There is a technical difference between a TLD and a domain ending. For example .uk is a TLD but .co.uk is not, it’s technically a subdomain of the TLD. But webmasters and users don’t care about technical definitions, so I’m going to treat them as the same for the purposes of this article.

If you have a really catchy, social, SEO perfect site name it’s useless unless you have the right ending to that great name.

Turns out not that many people are typing www.greatname.washingtondc.museum.

I extracted the domain endings for all the sites in my data and collected a neat list of all the domain endings I could identify in these 1.2 million URLs.

I used a list of all known TLDs from the Mozilla crew (I think?), but I can’t find the link so if you know the list I’m talking about please post the link in the comments section.

Luckily I downloaded and cleaned up the list into a nicely formatted text file so you can iterate through and check for matches if your running tests yourself.

Update: Thanks to Kris Turkaly who left a link to the list in the comments: http://publicsuffix.org/list/

After running my scripts and programs through the data it turns out there were 437 different domain endings in the dataset.

Thinking about it that’s a pretty small number of TLD’s for 1.2 million URLs but as you will see there is huge dominance with just 3 of domain endings.

I ranked them in order of number of sites out of the 342,740 that had that TLD. Here’s a handy Excel list of  all 437 TLD’s in that descending order.

And a nice graph of the top 5 domain TLDs:

(You can hover over each bar with your mouse and you’ll get the exact numbers)

It was hard to see some of the smaller TLDs so here’s the next 45 (blown up and zoomed in) and .us repeated again i.e. the 5th-50th most popular extensions. Even within this subset there’s a really huge drop off from the top of the list. Combine that with the top 5 domains and you see a gigantic dominance of the top 2 or 3 domain endings.

Of course this dataset isn’t designed to find the most popular TLD’s but it’s probably a pretty good idea of what users are used to.

Realistically .com and .org are the only global domain extensions you should be going for from a user’s point of view. And even if you own a .org you should be on the lookout for the .com variation.

Domain Length

I thought it would be interesting to see the distribution of domain lengths, so I counted the length in characters of all 342,740 sites without the domain ending.

Out of interest the average domain name length was 14.75 characters long, but the most common length was 8 characters with around 1 in 12 sites 8 characters long.

Again you can hover over each circle for exact stats.

Here’s an Excel file based on the above graph with the domain lengths and number of domains with the corresponding length.


This post is pretty good at showing what users are used to and are likely to accept. Of course common sense is still required. For example if your in Ireland, Irish users are very much at home and used to .ie domian names so maybe grabbing that and the .com domain name might be a good idea.

Again, if you find a really great, catchy name that’s longer than normal then go for it and if you find a short one that ticks all the boxes go for it too.

Hope you enjoyed this post and stay tuned for some correlation data later this week.

Interesting Google Parse Data

I recently completed an automatic searching of Google of 12,573 keywords. I extracted the top 100 results for those keywords, so around 1.2 million web pages and 342,740 unique domains.

This is the dataset I am working off of for the initial round of The Open Algorithm Project.

Over the last two days I’ve been pouring over and cleaning up the dataset and I have seen some really interesting stuff. Things you don’t notice when you do individual searches, even if you dig into the HTML of a search result page.

It’s sort of an eclectic mix of interesting things that I and my programs spotted and thought you might find interesting too.

Most Popular Domains

There were in total 342,740 unique domains. I’ve put together an Excel document with all the domains and the number of times each one showed up as a search result.

You can check if your site is in my dataset and contributing to the project.

Here’s my list of the 50 most popular sites in Google by the number of times they showed up in the search results. That’s the number of search results they accounted for, not the number of keywords they showed up for i.e. a site could show up multiple times for one keyword.

Interestingly the top 50 sites from the parse (which used US based IP addresses) is very similar to the top 50 sites in the US as judged by Alexa.

Note: The counting of Google.com includes universal search which accounted for 7,542 results.

View it full size here.

Most Popular Sites

I wasn’t surprised by the fact that Blogspot, Tumblr and WordPress did well in the top domains data. But I figured that a lot of the search results that they were ranking for were on subdomains, that weren’t really under their control. So I went back to the database and this time extracted the results down to the subdomain or domain they were hosted on i.e. the site not the domain. Google are sort of half and half on how they treat subdomains, I suspect they have some algorithmic feature to distinguish between blogspot or wordpress.com style subdomains vs. blog.yoursite.com. They could be using more complicated algorithms but I would imagine they are certainly counting the number of subdomains and diversity in terms of content topics across those subdomains as a general guide to what’s associated with a root domain and what’s not. You can download the full list of 383,638 sites with the number of search result appearances from this Excel file. As you can see, compared to the top 50 domains, WordPress, Blogspot, Tumblr, etc. have taken a hit.

View it full size here.

Universal Search

I thought it might be interesting to extract when, where and how much Google shows Universal search results from images, news and products which of course are all Google entities.

Image Search:

Number of times image results showed up (out of 12,573 keywords): 2,327

Modal/most common position: 1

Mean/average position: 8.01

Google News:

Number of times news results showed up (out of 12,573 keywords): 2,195

Modal/most common position: 11 (remember I extracted the top 100 results)

Mean/average position: 9.86

Products Search:

Number of times products search results showed up (out of 12,573 keywords): 3020

Modal/most common position: 2

Mean/average position: 8.8

Strange URLs

Update: It appears I was totally wrong on this one. While these strange URLs were indeed strange and were about a new Google feature. It wasn’t quite a feature I had thought about. Seer Interactive reported a change to the way Google formats its search results that now uses these strange URLs.

Out of the approximately 1.2 million URLs I extracted, I got back 185 really weird ones.

You can view the 185 of them here: http://www.theopenalgorithm.com/media/2012/02/Weird-Google-Search-Result-URLs.xlsx

All 185 URLs redirected to what would be a normal search result. There were only a handful of websites that were redirected to from the 185:








Pure speculation but that lengthy parameters (examples in the Excel file) would suggest that Google is for some reason tracking the clicks on these results in some special way, potentially to track and test the reaction to new algorithmic or other type of feature being trialled.

Interestingly the CricBuzz URL redirects to www.cricbuzz.com/gl (gl presumably meaning Google) but not all clicks from Google to the CricBuzz site redirect to /gl. It seems as though CricBuzz are tracking clicks from specific types of Google search results (namely, live cricket scores), for what reason other than internal is again only speculation.

All of the results with these strange URLs are in some way unique. Take the Google public data result, there’s a nice embedded graph:

Google Public Data search result

Or any of NCBI results:

Google's treatment of health related search results

I’ve highlighted 3 unique features of this result.

#1 The snippet is pulled cleverly by Google directly from the page.

#2 The special health identifier image to the left of the result.

#3 The links to sections within the article.

Of the 185 strange URLs pulled from the 1.2 million 170+ were from the NCBI site.

It would seem as though this site is being treated differently by Google. Maybe they recognise it as the expert in its field and are trialling some sort of improved indexing of expert health related sites?

A number that sprung to mind was 1%, as suggested by Stephen Levy in his book about Google, In The Plex, about 1% of searches have some sort of trial or something being tested outside the regular algorithm. 185 out of 12,000 searches isn’t far off 1%…food for thought.

Of course it could be something totally different, maybe a Googler or SEO can shed some light on the intriguing results.

I would be interested to hear if other people doing Google search result parsing over a long period of time have seen these types of results and if they in some way connected to some future feature launch within the search engine?

FTP and IP address

I wrote my programs to handle http:// and https:// web pages only. I had never noticed Google return any other type of page and assumed that’s all that would come back from my parse.

Turns out Google are happy to crawl and index more than just that. I’ve seen a couple of ftp:// pages returned e.g. search for “garmin” on Google.com, click through to around page 8-11 and you will see:


In fact out of 1.2 million URLs Google returned 22 ftp:// results, all but one of them were PDFs.

Here’s a list of the 22.

Notably a number of these seem to be meant to be private or only for a company intranet, so its impressive if not worrying that Google has found these documents.

Another strange one was one site returned only as a an IP address with no domain name, check it out:, the site’s domain name is realtek.com but for some reason Google saw fit to return the IP instead of the domain name. Maybe because the site doesn’t work with www. in its domain name?


Hope you found the post interesting, if you spot anything fun or unusual in the data files I have linked to in this post let me know in the comments section.