Category Archives: Correlation Data

Why Google’s Algorithm Doesn’t Care What You Write

Most SEOs understand that on page factors have and will continue to decline in importance within the Google algorithm and SEO.

We all know that Google takes no notice of the meta keywords tag, and little notice of other tags, markups and HTML structures which has been backed up with my data.

But “in content” factors i.e. ranking signals related to the actual text/content on a page, would appear to be a separate matter.

Spamming the meta description doesn’t hurt a user’s experience but spamming the actual content the user sees is detrimental to the user and therefore not worth doing, right? Surely damaging the user’s experience isn’t worth those few extra visitors from Google, seen as they will probably convert less due to the comparatively less helpful content?

It sounds good and makes sense, and as a result we in the SEO community have come to the conclusion that while Google might ignore those other on page factors, they probably have some really smart methods to figure out what a page is about and the quality of the content on that page.

I decided to look at 5 really basic factors that you would think may feature in some element of the Google algorithm or would be closely related to a ranking factor.

While I understand there are better and more advanced methods (LDA, TF*IDF) for comparing the content on a page to a given keyword or judging the quality of a given piece of content I tested these really basic factors to judge Google’s likely weighting of “in content” factors.

I may test more advanced topic modelling algorithms/factors in the future but for now I have stuck to some old information retrieval reliables.

You are probably already aware of the general method for my correlation studies but if this is your first time here please read this and this.

Special thank you to Mike Tung of Diffbot for providing me with free access to their article API which is undoubtedly the best text extraction service/algorithm out there. And as if the solidify that point congrats to the team on their recent 2 million dollar funding round.

Data

 

Chart: In Content Factors Correlation DataDescription: Tags: Author:

This data is based on a dataset of over 1.2 million web pages from over 12,000 unique keywords and the correlations are derived from Spearman’s Rank Correlation Coefficient.

Analysis

Images and Videos

Having images and videos on a page is generally accepted to be good for the user. What I was interested in, was seeing whether this translated into increased Google ranking or not, and apparently it doesn’t.

Assuming Google likes pages with images and videos (large but reasonable assumption), this is quite a good test of how low-level Google are willing to go to promote pages that are in line with what they want to see i.e. are Google willing to reward these pages in the algorithm or do they prefer the great PageRank algorithm to help them identify what users like and want.

Page Contains Bad Words

This is another fascinating test, I checked whether a page contains any bad word from this list.

I thought that using these naughty words would surely relegate you from the search results or at least lower your ranking, but the data doesn’t support that.

Unfortunately there is a cautionary note, Google likely run a somewhat more advanced algorithm over content, instead of checking for just the presence of these bad words they probably look at their likely intent and as a result either ban a page or give it no penalty.

A news article quoting a foul mouthed sports star shouldn’t be banned from the search results because of its harmless and informative intent.

Because of this the pages that weren’t excluded from the search results and therefore likely received no penalty were the only ones to show up in my dataset.

As a result it would be unfair to draw conclusions regarding Google’s implementation of bans/penalties towards pages using these bad words.

Is the Keyword Even in the Content?

Most of us would think that this test is a sure thing. Forget keyword density, but surely having the keyword in the content of a page is absolutely vital to that page ranking well for that keyword but again the data says otherwise.

How can this be? The most basic and obvious steps for checking whether a page is about a keyword is to check whether the page contains that keyword, how else would Google narrow down their massive index into something more manageable?

Well there is anchor text, meta description, title tags and many other areas that Google may look at to check whether a page is about a keyword or not.

But what this astoundingly low correlation suggests is not only that Google likely doesn’t implement such a factor (when ranking pages) but also that Google probably isn’t using other super-advanced topic modelling algorithms, as most of these algorithms are based on the assumption that the keyword is in the content and all of them are based on the assumption that there is textual content.

Distance to 1st Keyword Match

I was a little more sceptical about this factor correlating well and rightly so. This old-school factor might have been in use in the days of Alta-Vista but most of us would agree its not so likely to be around any more.

Summary

While other topic modelling algorithms might correlate higher than the above factors most of them are based on the simple assumption that a page contains the keyword you are trying to model for and that there is textual content, which are dangerous assumptions for Google to make.

Nobody can make a blanket statement like, “Google don’t analyse what you write and don’t care what’s in the content of a page” but the data does point us in that direction.

If you are to accept the theory that Google don’t take such a low-level and specific view of pages or at least don’t weight such a view very highly then it is easy to come up with reasonable justification for that theory.

For example if Google takes such a low-level view then how does it understand infographics or video blogs? How will such an algorithm scale with the web as it evolves further into a multi-media and not just a textual internet?

I don’t believe the data is in anyway conclusive and I do believe that other “clever” topic modelling algorithms may correlate well with ranking but whether or not that means Google implements such factors within their algorithm is another debate.

What I will say is that I believe Google most likely take a much higher-level view of pages than we think, using links, social media, PageRank and other more scalable factors to determine the relevance and quality of a web page rather than looking at on page or in content factors.

As a result I would recommend that all webmasters create content, title tags and web pages with the user and not the search engine in mind and optimize for other far more scalable factors like link building, social media, page loading speed, etc.

Links – Huge Correlation Between Link Building and Google Ranking

Links have been an integral part of SEO since Google joined the scene.

But recently link building’s popularity has taken a bit of a hit, with many believing that Google have reduced its weighting of PageRank in the algorithm. The emergence social signals and other factors indicating user satisfaction have according to many within the industry eclipsed (or will in the future will eclipse) links as the primary ranking factor.

But this speculation hasn’t been mirrored in my data. Over the course of this post we will examine over 40 link related factors, all of which correlate very well, and a number of which are the most heavily weighted factors in my study.

The main finding from this data, is how well links correlate to ranking in Google. I have tested over 150 potential ranking factors in 6 categories and without a doubt, links stand head and shoulders above any other section of factors.

Link building is a bit of an ugly duckling within the industry, everybody knows its importance, but very few are effective in its practice.

Unlike changing title tags, building quality links requires skill, creativity and determination. Its not easy work, its not the low hanging fruit, but based on the data below, it appears to be the most rewarding.

While I won’t discuss link building strategies in this post, I would like to mention that I feel many strategies are extremely inefficient and unproductive and a lot of the theory behind this area of SEO is fundamentally flawed. I will be publishing some more of these ideas, with anecdotal evidence in the future.

The project

The below data is based on a dataset of the top 100 results in Google, for 12,573 keywords.

I have analysed this data using Spearman’s Rank Correlation Co-efficient, looking for relationships between individual factors and ranking in Google.

I have already published some of the results from the study including domain name related factors, on page factors and domain authority signals.

This is all part of a greater project to bring more science to SEO and make it a truly data driven industry.

There are inherent issues with correlations and they don’t prove anything per se, but as I have covered these issues before I won’t rehash old information, what I will suggest is – that if this is your first time on the site, please read this and this.

I would like to thank SEOMoz for providing incredible access to both their amazing Mozscape API, from which the below results are derived and their expertise and advice. In particular I’d like to thank Rand Fishkin, Dr. Matt Peters and the API support team for all their help.

Data

This Excel Spreadsheet provides the keyword by keyword correlation figures from which the above mean correlations are derived.

Breakdown

Google’s algorithm doesn’t just look at how many links there are to a page, it looks at quality signals, website authority indicators and tries to protect against manipulation.

Basically, just building links isn’t good enough, there are certain kinds of links that are better than others.

Below I have covered the types and areas of link building that are thought to be utilised within the algorithm.

General Links

The correlations for general links, as compared to specific counts such as # of IPs/Cblocks/Domains/Subdomains are significantly lower.

This supports the fact that Google looks at several factors and classifiers when considering the quality of the source of a link.

While this certainly isn’t an interesting finding, it is important from the point of view, that such a conclusion supports a known fact and therefore increases the likelyhood that the data gathered and the resulting correlations are correct and do represent what’s actually happening within the Google algorithm.

I investigate which particular classifiers and types of links would be best in a link profile, below.

Cblocks and IPs

 

Both the number of unique Cblocks and IPs linking to a site are thought to indicate the diversity of a link profile.

Google want to see a variety of sites “voting” for a website’s content. The weighting of each additional link from the same site is reduced relative to a link from a new source.

Knowing this many webmasters began to build “lens sites”, that’s sole goal was to link to the mother site.

It is believed, that to counter this Google implemented an algorithm that could figure out if a link was coming from the same source (i.e. the same webmaster) as the site that was being linked to.

There are a number of factors that Google likely use in such an algorithm, but it would make sense that Google treat links coming from the same IP or Cblock as more likely to be coming from the same webmaster, and thus marginally less trustworthy.

While the data doesn’t prove or disprove this theory, it does show a higher level of correlation for the # of Cblocks/IPs linking than for a general count of the # of links to a page/site/subdomain. Although the difference is small it could support the above theory.

With this data and using some common sense, I would recommend following the current industry practice of building a diversified link profile.

Domains and Subdomains

 

Again the above data further enhances the argument for a diversified link profile.

It also shows a potentially interesting albeit small difference between the # of unique domains vs. subdomains linking. With the # of unique domains coming out on top.

While the difference is too small to make a concrete conclusion, such data would certainly point us in the direction of building links from a diversified set of domains, and treating subdomains on the same root domain as related to each other and therefore each additional link from a separate subdomain on the same root domain as slightly less valuable than the link before it.

Links to the page

The above data conforms to the seemingly obvious conclusion that if you want to get a page to rank well, then building links directly to that page is the best way to get that to happen.

While most SEO’s will find that stupidly basic, I have seen some SEO’s suggesting that domain level links would be more powerful or a better use of time. The data just doesn’t support that strategy if you are trying to increase the ranking of a specific page.

Links to the page’s domain vs. subdomain

 

Interestingly the strong performance of domains vs. subdomains as the source of a link, is not matched in the location/target of a link. If we are to believe that such marginal differences are important, then the data may suggest (as a number of industry watchers have stated) that Google treat subdomains as separate to the root domain in looking at the host’s (which could be the domain or subdomain) authority.

This seems strange, and I may be reading too much into the data but if the above statement was the case, then Google’s treatment of subdomains as separate sources of content would not be matched by their treatment of subdomains on the same root domain as essentially the same source of links.

If such a conclusion were to be made, then it would be most likely to explained away by the likelihood that Google doesn’t just look at whether its a subdomain or not, and it likely uses much more advanced algorithms to figure out whether a subdomain should be considered part of the same domain.

Thus Google would understand that blogname.wordpress.com is not related to wordpress.com but blog.exampledomain.com is related to exampledomain.com.

Nofollow vs. Followed

 

Here is a classic case of inter-related factors impacting on the correlations of each other, we know that nofollowed links carry no SEO benefit directly, although they may result in some other factors being impacted e.g. someone clicks on a nofollow link and then shares the page on Twitter.

A page with a lot of nofollow links pointing to it, is far more likely to have a lot of followed links pointing to it.

This is because there are standard ratios, different types of links hold within the link profile. And any deliberate alteration by a webmaster is only likely to result in a small shift in those ratios.

There are many inter-factor relationships going on in the above data. Nofollow links may indeed carry no search engine benefit, but could still show the strong correlations, as above.

Marginal differences in the correlations shown by different categories of links, e.g. followed vs. nofollowed may be more important than it appears as face value.

This is why I have read a lot into such small differences.

SEOMoz Metrics

SEOMoz have created a number of algorithms that are meant to mimic Google’s link related algorithms. I don’t know the exact make-up of these algorithms, but I thought it would be interesting the test the performance of these algorithms, to check whether using these metrics as a measure of the success of your link building is a good idea.

If you are interested, here’s the general make-up of these algorithms: MozTrust, MozRank, Domain Authority, Page Authority.
 

Wow! Moz really seem to have done a great job developing their algorithms. In keeping with the above data on the value of page level metrics, Page Authority comes out at an astounding .36 correlation, which is massive, making it the highest correlated factor out of the 150+ I have tested.

Comments

The link related data is in my opinion is on par with the on page factors as being the most interesting and important to the SEO industry. Both lead to the same conclusion, on page factors are by far less important than off page factors.

Links aren’t just about SEO

Building links isn’t just an exercise in SEO, its also an exercise in marketing. Links can drive a lot of direct traffic from people clicking on them and also can build your brand name.

Its important to factor the direct traffic value of links into your link building decisions. This is particularly evident where a second, third or fourth link from the same site, may seem like a step down in SEO importance but may still provide high value direct traffic.

Links aren’t dead!

If I read another article proclaiming PageRank or link building is dead, I’ll scream. Its very simple, the scientific data simply does not support the speculative accusations of the reduced value of PageRank or link building.

In fact in many cases their level of correlation has increased, not decreased since Moz conducted their 2011 study.

Link related factors are far and above the highest correlated set of factors.

While we in the SEO industry recognise the importance of links, I don’t think we covert this mental idea, into action. I don’t believe that SEOs spend the right proportion of their time on link building. And SEO blogs, conferences and experts certainly don’t talk enough about how to do great link building.

There definitely isn’t enough data available on what the best link building strategies are, with the majority of link related blog posts stemming from speculation, not data driven proof, something I hope to address scientifically through this project.

I welcome presentations like this from Mike King, that back up strategies with solid data.

Bottom line – spend a whole lot more time link building.

Domain Authority

New Correlation Data Suggests Enhanced Importance of Site Wide SEO

 

SEO’s are huge believers in signals relating to Google’s overall perception of a website.

It makes a lot of sense, if Google can understand that Wikipedia’s articles are typically of a higher standard than eHow’s then they can make better decisions on the quality and relevance of web pages on these domains.

By using this data search engines can also make quick decisions regarding new content published by these sites. This fresh content wouldn’t have gained the links and other time related ranking factors as an established article, but may still be relevant to the user. This may be particularly true with news or “query deserves freshness” results.

In addition to gathering data that might indicate the quality of content published on the site, it is thought that Google gathers data on what geographical location, type of user, industry, etc the site targets. Much of this data is difficult or in many cases impossible to gather without being Google, for example a site’s average SERP CTR or bounce rate.

Overall it would be fair to say that Google utilises different models to gather and analyse domain level data pointing to the authority of a website as a whole.

The potential value of domain level factors to the webmaster is immense. If you make a single site-wide improvement, it may impact the ranking of several thousand pages on the site. Domain level SEO offers easy to implement strategies that can hold a much higher ROI than page by page factors.

What data is collected by Google and how much influence it has in the overall ranking of a web page has been theorised and debated for many a year.

Overall what we will see in this article is that domain authority signals are relatively highly correlated, and that for the most part, many of the industry’s theories surrounding these factors have largely been correct, which is refreshing in light of some stunning on page factors’ correlation data.

The study

Over the past 2 months I have gathered data on 31 domain authority signals, for the top 100 results in Google, for 12,573 keywords.

I have analysed this data using Spearman’s Rank Correlation Co-efficient, looking for relationships between individual factors and ranking in Google.

I have also studied several other areas of SEO. I have published some of these results (including domain name related factors and on page factors) although some results haven’t been made public yet and will be published over the coming weeks.

This is all part of a greater project to bring more science to SEO and make it a truly data driven industry.

There are inherent issues with correlations and they don’t prove anything per se, but as I have covered these issues before I won’t rehash old information, what I will suggest is – that if this is your first time on the site, please read this and this.

I would like to thank Link Research Tools for generously providing me with free access to their highly useful API from which all the below correlations are derived.

Please note: while domain level link metrics could be included in this post I have decided to deal with all link related factors in a separate post which will be published in the near future.

Data

Chart: Domain Authority SignalsDescription: Tags: Author:

If you wish to see the keyword by keyword correlations that resulted in the mean correlations reported above, feel free to download this spreadsheet with all the relevant data.

Definitions

Here’s some handy definitions in case you aren’t sure what some of the above factors are;

  • Domain age, is the time since the domain was first registered.
  • PageSpeed rating, is the Google measured score out of 100 on how well a page is performing with regards to several indicators of how quickly a page loads. The higher the score the faster the performance.
  • Days to domain expiry, is the time until the domain expires or needs to be re-registered.
  • Alexa and Compete rank, are both independent measures of how much traffic a site gets. The lower the score, the more traffic the site is supposedly getting.
  • Basic, intermediate and advanced reading levels, are Google measures, of what reading standard a given page is at.

 

Trust indicators

Chart: Domain Trust IndicatorsDescription: Tags: Author:

Google are always trying to figure out how trustworthy a site and its content is. Many theories have emerged as to what factors likely impact the trustworthiness of a whole site.

Domain age, is a classic and while I personally am sceptical about its use as a direct ranking factor, it does seem to have a strong relationship to ranking well in Google, with a near 0.2 correlation, which is highly significant.

How much of this can be written off due to the increased time available to established sites to build links and content and of course just the pure common sense – that a site running for a significant length of time will only have survived by providing for a user’s needs, is hard to determine. Domain age is a factor that’s impossible to manipulate, only worthy for consideration in the procurement of a new web property.

But by saying that its impossible to manipulate, I am then strengthening the case for Google’s use of the factor. So the truth is, its difficult to say whether its a factor or not. It does correlate well, so I would suggest that if you come across a situation where domain age is being considered give it some but not substantial weight in whatever decision you are making.

Homepage PageRank, and PageRank in general is one of the most hotly debated topics on the SEO circuit. We all know of the PageRank Toolbar’s problems and unrepresentative view of the real PageRank Google calculates and uses within their algorithm.

But at the same time the social data Google may pull from APIs may be more complete than the data I have access to and the internal Google link graph is even larger than the gigantic SEOMoz link graph yet we treat these representations of what Google sees as perfectly good.

My point is not that social data and link counts should be disregarded but that perhaps some, if not all of our suspicion at the value of PageRank as a metric is misplaced.

The importance of PageRank is backed up in its mighty performance in the correlation study, the highest correlated domain level authority signal at .244.

This and data on domain level link metrics which I will be publishing in the coming weeks has solidified my view that Google certainly weights and utilises domain link popularity in the ranking of content on a site.

Thus it is reasonable to recommend the already popular theory of building links to the homepage and domain as a whole.

Whether homepage link building warrants special treatment, is dubious and I would in general advise a strategy of building links to a domain as a whole, linking to the homepage only when it feels right and not because of any particular strategy.

Days to domain expiry, is an intriguing and interesting idea, that how long the webmaster registers a domain into the future is an indicator of the webmaster’s intent at creating a long-term user resource.

The marginal correlation at .089 probably suggests its minimal to lack of weight within the algorithm. In saying that, it is an easy and inexpensive factor to manipulate and even a marginal boost in search engine performance would be worth the puny risk.

There have been theories in the past which suggest its importance to newly registered sites, which again complies with basic common sense.

I can recommend registering your domain for 3+ years as a simple, one time, SEO strategy that may or may not impact ranking but certainly has no significant downside.

Site size

Chart: Site Size Correlation DataDescription: Tags: Author:

Alexa and Compete rank, I doubt whether the amount of traffic a site gets is a ranking factor. But its significant correlation may be indicative of a deeper positive correlation from Google towards larger sites.

Whether this is due to ranking factors in favour of larger sites, these sites performing better in non-discriminative factors or something else is worth pondering.

What I will say is that in general sites are large because they are useful to users and its a search engine’s job to try to find sites that are helpful and useful for users.

The same logic should track for the number of pages in Google’s index of a site,while this is highly unlikely to be a direct ranking factor it is perhaps an indicator of other factors actually implemented in the algorithm.

If the data is taken at face value, the then it would appear somewhat surprising that larger sites are performing worse, although the reliability of Google’s provision of this data appears to have impacted results.

I would like to test this factor and other similar indicators further before drawing a definite conclusion.

Geographic targeting

Chart: IP Location of Web ServerDescription: Tags:

The near random correlations for the geographic location of host servers is not surprising and in fact not very interesting at all.

I tested it purely to check whether there was any significant correlation but I didn’t expect there to be as  I conducted my searches from which these correlations are drawn, on Google.com.

The theory of geographic targeting is largely protested to be in use in non USA countries. In the future I hope to conduct studies on non-US versions of Google and to recheck this factor, but for the meantime the data is inconclusive and the current theories within the industry on server location should be followed.

Reading Levels

Chart: Homepage (Google) Reading LevelsDescription: Tags:

While the data is somewhat flawed in that Link Research Tools didn’t return data on a significant number of domains for this factor and the fact that homepage reading levels may not be the same as page level reading levels, the idea and the testing of such a factor is very interesting.

It is something that I believe Google to be using as a factor in the personalisation of search results. For example if they have figure out you are an 8 year old, then maybe you don’t want Shakespeare or research papers returned and you want content written in the language that you as an eight year old use. Not to mention the fact that not many eight year olds are searching for “Macbeth” or “quantum physics”.

A broad correlation study is not conducive to making a recommendation on what language you as a webmaster should use, but it is an interesting topic and something that you should consider when you are writing. Who are your audience and are you writing in their language?

Registrar

Chart: Domain RegistrarsDescription: Tags:

This was a rather cheeky test, and was never likely to reveal a ranking factor, more likely to represent the success achieved by sites registered through the above registrars.

I wasn’t surprised to see GoDaddy with the worst correlation as its add-on products and the clientèle don’t quite indicate quality or high editorial standards, not that many registrars do.

Once you understand and are disciplined with your implementation of SEO and general website ownership standards and strategies then the registrar you choose shouldn’t impact your ranking. But if you are new to the game or likely lead astray, then a registrar and host that promotes these standards may prove a more fruitful path.

Miscellaneous

Chart: Other Domain Authority SignalsDescription: Tags: Author:

The PageSpeed ranking is important, it suggests that if a site follows good principals with regard to the loading of content it will be rewarded with higher rankings. Tests on a page by page basis would be even more conclusive, but this reasonably high correlation for homepage level PageSpeed vindicates some of the excitement generated by Google announcing it used site loading speed in rankings.

The incredibly large correlation for both total and nofollowed external links on the homepage of a site is puzzling to say the least, although the internal data seems more explainable.

While I have some ideas on what may be causing such large correlations, primarily surrounding the type of site that would link to another website from its homepage, I have no real explanation. If you have an idea, guess or have experienced this in the field then please leave a comment below the post.

Social metrics

Chart: Homepage Level Social Media MetricsDescription: Tags:

Wow! I saved the best till last.

Some super interesting social media correlations, with the general theme being that social media is really important.

The fact that Facebook and Google + links to the homepage of a site are the lowest correlated of the bunch is rather strange. The Facebook data could be explained by a possible block on Google accessing FB data. But Google Plus?

Perhaps this indicates that homepage social media shares are not used as a ranking factor but that the other social networks have such a strong user base, that recommend quality content that these social media shares actually represent a measure of the quality of the site as a whole, hence explaining the high correlation.

Also the fact that Google + has a relatively small user base, may mean that its disruptive influence on other factors such as links arising from the additional traffic sent to the site by high levels of sharing of the site on Google + is minimised.

Another explanation is that Google is using Digg, Reddit and StumbleUpon data more than we know about and we should focus more effort on these social networks and Twitter.

But again I’m not certain what these correlations mean, if you have any ideas on these correlations or you have seen Reddit, Digg or StumbleUpon marketing result in increases rankings for your site then please leave a comment below.

Further study of these factors on a page level basis would tell us more about these speculations.

Summary

The correlations for domain level authority signals are comparatively higher than those seen by on page factors.

Domain level factors are ideal starting points for an SEO and often provide a one time, easy change that could, based on the above results, have a substantial impact on ranking.

Even if you disregard the individual factors above as ranking signals, it would still be more than fair to conclude that domain level SEO is very powerful and you should be constantly trying to improve the domain, through site-wide enhancements.

Some of the results, in particular the social and homepage links are somewhat puzzling and I am looking forward to hearing what people think are the likely causes of such strong correlations.

I will be publishing the link related domain authority factors in the coming weeks, so stay tuned.

On Page Factors Correlation Data

Sometimes the least interesting results are the most interesting.

Long gone are the days of stuffing meta data and H1 tags full of keywords and getting high rankings, all SEOs worth their salt know that. What we will see, over the course of this post is a shocking lack of correlation for the traditional factors that are drilled into all newcomers to SEO by a media machine fuelled by outdated information.

Many industry leading pundits have touted such factors, and sentiment surveys run by SEOMoz show that many experts are likely to be massively out of touch with the true state of SEO.

In the coming days we will also see some fascinating results for other apparently less “in vogue” factors.

In this post we will examine the correlations I have found for 41 on page factors using Spearman’s Rank Correlation Coefficient and shed some subjective light on the possible meaning of these findings.

If you haven’t already, I highly recommend reading up on what a correlation really means, viewing some of my earlier findings to be used as a benchmark against the below findings and if this is your first time on the site make sure you read “what the project is all about“.

Once again I will warn of the dangers of taking correlations at face value and the importance of a more holistic view of the data below, the above links explain my reasoning.

Quick Recap

TheOpenAlgorithm is a project to bring more science to SEO. The initial phase is a correlation study examining the relationship between the ranking of the top #100 results for over 12,000 keywords and over 150 factors that might impact ranking in Google.

I’ve already published a bunch of data and will continue to do so in the near future.

Looking more long-term, I hope to test more factors on more data and ultimately prove causation between factors and ranking in Google.

There are also some people who have been vital to the success of the project, who I’d like to thank.

Data

Chart: On Page Factors' CorrelationsDescription: Tags:

For the eager beavers among us you may enjoy seeing the individual correlation calculations for each keyword I examined, here’s a handy excel sheet with everything you need.

Analysis

The first thing that strikes you with regard to the chart is the startlingly low correlations for each of the above factors.

Lets go through the most important correlations:

Title Tags

Chart: Title Tag Related FactorsDescription: Tags:

Historically title tags are a favourite of SEOs. Consistently sighted as important, even the steady demise in the value of on page factors doesn’t seem to have affected their weight in the average webmaster’s mind.

In fact the two factors with the highest sentiment in a recent Moz survey were; having the keyword present in the title tag and having it at the start of the title. The correlations in both this and the Moz study, almost directly contradict the industry’s leading thinkers.

The correlation of all of the factors related to the title tag recorded above are so close to zero they can be considered random.

Classic factors such as having the keyword in your title tag, starting the title with a keyword or using variations of the keyword multiple times all showed near random correlation.

What this likely means is that, these factors have little or no bearing on the ranking of a web page in Google.

Meta Keywords

Chart: Meta Keywords Related FactorsDescription: Tags:

Despite knowing that Google don’t take the meta keywords into account when ranking a web page, I decided to test related factors for two reasons, the first being to check if there were penalties for using it and the second as a handy benchmark for what can be considered a random correlation in this field of factors.

As we can see having the keyword in the meta keywords is the 4th most positively correlated factor out of the 41, thus anyone trying to justify the importance (based on correlation data) of any other factors of similar levels of correlation is silly.

There doesn’t seem to be any penalty incurred as a result of use of the meta keywords tag. Although I wouldn’t rule it out, abusing the tag likely being one of several ingredients Google may use as a signal for possible low-quality, over-optimized or spam content.

Meta Description

Chart: Meta Description Related FactorsDescription: Tags:

Again one of the great stalwarts of old-SEO is shown to have little or no value to the ranking of a web page. But before you throw out the meta description tag I must recognise that it has other value in terms of allowing you to control (to some degree) the highly important snippet shown to users on Google SERPs.

Heading Tags

Chart: Heading Tags Correlation DataDescription: Tags:

The use of keywords in H1, H2, H3 and H4 tags is one of webmaters’ more popular factors in search engine optimization, despite lately being discounted as less important by a number of SEOs.

We see no significant correlation for these factors in our study and thus no reason to consider them important or vital to the successful SEO strategy utilised by the intelligent webmaster.

Images

Chart: Image Related FactorsDescription: Tags:

One of my old personal favourites; keyword usage in img names, alts, and titles has been consistently trending for quite a while now, primarily due to their relatively reduced popularity in the pre/early Google hay-days of SEO, resulting in their exclusion from the typical list of on page factors Google may be ignoring.

While these factors may be important in getting traffic from Google images it is appears likely that their status is indifferent towards increased search engine ranking.

Links

Chart: Links On Page FactorsDescription: Tags:

While I would have liked to test some more advanced factors related to outbound and internal links, unfortunately that will have to wait until the second iteration of the study as I lacked the required computing power to test such factors.

For the factors I did test, some of which were fairly original and I was quite excited about, I once again saw close to random correlation.

Furthermore I can advise against using abnormal strategies in terms of anchor text or nofollow tags of outbound or internal links for the SEO benefit of the linking page. I recommend linking in the most user-friendly manner and using the nofollow tag for its intended purpose of flagging potentially editorially unsound links.

Unique Tags

Chart: Unique HTML TagsDescription: Tags:

I was also pretty interested to see the correlation for the presence of the canonical tag on a page, which I thought might be a good sign of an astute webmaster who is protecting against potential duplicate content penalties, but I was also worried about a possible cancelling effect caused by web pages groomed for Google utilising the tag the most. I saw no significant correlation, though I would still advise you to use the tag to protect against Google crawler errors.

Noframes and noscript tags are good indicators of user conscious sites using iframes or Javascript respectively. These factors are easy to test indicators of sites using these technologies. In the future I may look at different methods to test directly whether utilising iframes or Java impacts ranking but for now no definite conclusion can be drawn.

Rel=author and rel=me were new tags implemented to allow writers to identify their work on various sites and to allow Google to use their information to provide more relevant search results to users. In addition they are of course very closely related to a rather hot-topic of late -Google + profiles and their importance to SEO. The mere presence of these tags have no correlation but perhaps in future studies, the power of the linked Google + profile may be a factor I will test.

Length

Chart: HTML LengthDescription: Tags:

The length in HTML and the HTML within the <body> tag were the highest correlated factors, in fact with correlations of .12 they could be considered somewhat if not hugely significant.

While these factors probably are not implemented within the algorithm, they are good signs of what Google is looking for; quality content, which in many cases means long or at least sufficiently lengthy pages.

Summary

The consistent lack of significant correlation for on page/HTML related factors many of which are reputed to be highly important to SEO is an indictment of some of the poor information available to SEOs.

You’re relevant, but are you quality?

One mitigating factor stopping us all abandoning the old school SEO strategies is a theory that may explain some of the low correlations. That is that once Google identifies a web page, based on some on page factors, as relevant to a keyword, they then may throw that to one side (or insignificantly weight the relevance level) and then look at other factors such as links, website authority, social media, etc as a measure of quality.

What this would mean would be that it would still be important (although less so) to use keywords in some of these areas to make sure Google understands you are relevant to the query and then focus on other higher correlated factors.

This is of course merely a theory and not one that would require much implementation, assuming that if you are targeting a keyword you will already have quality content relevant to that keyword.

Tests on more results per query e.g. top 1,000 or 10,000 results/query, or on a set of less competitive keywords may be more revealing in the viability of this theory and thus without such tests or causal data available I can not be 100% sure in my devaluation of the importance of these factors, but the data available would strongly suggest such a devaluation.

If you can manipulate it, Google probably aren’t using it

Arising from these results and the sustained trend of Google moving away from easily manipulated factors I have come to a pretty common sense and simple conclusion: that if a factor is directly and easily manipulated by a webmaster, the weight assigned to that factor is likely to be relatively low.

This would therefore presumably apply to Schema.org an initiative I believe to be of high importance to search engine optimization. Thus this rule of thumb may be proven incorrect or vindicated in the future, I guess we will just have to wait and see.

Trust the numbers, not the opinions

While I am sure that this post will be fairly controversial as it runs against many long-held beliefs within the industry, the findings have been backed up by the similar results seen in the SEOMoz study and by Google’s continued stressing of the importance of focussing on quality not arbitrary on page factors.

Thus we must learn to trust the numbers. While of course numbers aren’t perfect and I have outlined potential flaws, at least they aren’t open to manipulation by human emotions based on untested and unvindicated opinions that have outlasted their sell-by-date.

These results in particular show the need for emotionless analysis of SEO theories and the value or lack thereof opinions and observations not backed by scientifically significant levels of data.

Was Matt Right?

How many times has Matt Cutts said not to focus on individual on page factors but to focus on creating a page of value for the user?

Many of us, fooled into believing in the unquestionable value of these factors,  including myself saw this as a classic attempt to shape our ideas of “what SEO is”, and Cutts trying to stop people from manipulating weaknesses in the Google algorithm.

As it turns out he was probably speaking the gospel that I personally and many others needed, yet ignored.

The Self-Fulfilling Cycle of Industry Drivel

One clear observation arising from the above results is the need to question commonly-held beliefs not just within SEO but everywhere.

The problem with SEO is that search engines are always changing, theories established 3 years ago may be outdated and need continued testing and re-assessment.

But what we have seen within some areas of the industry is a belief becoming so ingrained in SEO’s psyche that it is near taboo to question it. It is indoctrinated into every individual entering the industry.

As a result with no particular scientific proof authors, including myself, wrote about the importance of such factors, and strategies to maximise gain from the manipulation of these factors.

This only further fuels the fire and increases the apparent importance of these arbitrary data points.

Essentially its a house of cards, and hopefully this data and further tests will prove to people that these factors just aren’t that important and as a result the house will tumble, although, I suspect such a major change in the beliefs of an industry are harder changed than this.

 

While I don’t rule out the viability of on page SEO, I believe that Google use much higher level factors than we within the industry have thought about or used. I am aware of enhanced relevancy algorithms such as TD*IDF and think these efforts at better understanding Google are admirable.

On the whole I am fairly pessimistic about the value of on page SEO under current theories and strategies. I would suggest SEOs put no more than 5% of their time on the subject, much of which should be focussed on creating a structure that germinates well optimized content (similar to many of the WordPress features).

I would recommend focussing marketing budgets on creating unique, high quality content, optimized for the user and focus the majority of the SEO effort on building a beautiful, fast website architecture, link building, social media exposure and other well correlated factors.

Domain Name SEO

Cartoon of wrenches building a domain nameLast week I published data on 342,740 domains that I extracted from the dataset I have built for TheOpenAlgorithm project.

What I looked at was mostly from a user’s point of view, not very scientific but pretty significant. I wanted to show, what the user was used to/found normal in terms of domains. What domain TLD (.com, .org, .net, etc.) they expected and number crunching on average domain lengths.

If you don’t have time to read the post there were two basic findings.

  • If your buying a domain get a .com, turns out .com domains are heavily represented in the dataset with 78% of the domains I looked at being a .com. Even if you want a local TLD e.g. .fr or .ie or you want to go for .org, .net, .info, etc. its really important that you try and get your hands on the .com version of the domain, users are so used to it and assume a domain is .com unless its particularly obvious to them.
  • The second finding, if less interesting was that the average domain length was 15 characters long but the most common length being 8 characters. Obviously common sense applies here, shorter = better, but the domain must make sense to your audience.

 

Today I’m publishing my first correlation data!

If your not sure what a correlation study is or what its good for read this article!

A correlation shows the size of the relationship between ranking well and whatever factor I am testing.

All correlations are between -1 and +1, with a negative (-) number meaning a negative relationship i.e. that as you do more of it your rankings are likely to suffer. And a positive number means as you do more of it your rankings are likely to increase.

The closer the correlation is to either of the ones (-1 or 1) the stronger the relationship i.e. the more powerful/important the factor.

For example a 0.8 correlation is more significant than a 0.3.

Correlation studies aren’t perfect and don’t prove causation. If you haven’t heard of correlations before then this post on what they are, how to do one and what are their weaknesses will be very helpful.

I’ll be looking at a few factors at a time and then publishing the data on that set of factors. I’ll be posting more data on various factors over the coming weeks and months and I hope to break them down into actionable chunks and sections within the algorithm that SEOs/readers can digest them easily.

But today I’m looking at the factors that SEO’s and new webmasters should consider before they buy a domain name.

Domain TLD

Here’s the correlation data for what TLD of a domain had.

Chart: Domain TLDs/Endings CorrelationsDescription: Domain TLDs Correlation Data.Tags: Domain TLDs

 

The correlations are very low which likely means that there is little or no relationship, but the minor correlations do show a mild preference for .com and .org domains while .net, .info and .us were negatively correlated.

It’s hard to say what is causing this small correlation, we know Google has taken action against individual TLDs in the past and it seems probable that some TLDs are targeted as spam or credible but the small correlations seen below are more likely to be a combination of Google’s preferences in domain TLDs and the types of sites that register each domain TLD.

While the correlation data below isn’t that significant it does seem to be in line with other data I have seen (some of which is below) which leads me to believe that there is merit in it and if you were down to a straight shoot off between two domain TLDs (and couldn’t get both) this would be a neat tie-breaker.

Probably not something to get too worried about but interesting data.

As you will see throughout this post .com domains are way out in front and .org domains are the next best, but it appears that beyond those you are in the danger zone unless you are registering a local TLD like .fr or .co.uk.

Exact Match Domains

One of the most heralded and contentious factors among SEOs  is exact match domains i.e. domains that are the same as what the user searched for e.g. keyword: “the open algorithm”, exact match=http://www.theopenalgorithm.com.

Previous studies by SEOMoz have shown it to be a very powerful factor.

There are a number of points to be aware of before reviewing the findings.

Exact match domains make sense from a user’s point of view. If a domain is an exact match for the search query then it is likely to be relevant to the query. In addition it is possible that the user is searching specifically for that site.

There’s a potential anchor text boost in having an exact match domain. When you own an exact match domain it is likely that more sites will link to you with the keyword as the anchor text, for example I would imagine that tons of webmasters link to SearchEngineLand just as I did there, with the site’s name as its anchor text.

As a result there is the potential for the correlation for exact match domains to be slightly inflated due to its presumed benefit to anchor text related factors, unless Google have a clause in their algorithm to negate this effect.

If a domain is a company or some organisation then surely the fact that it is an entity is a better reason to show an exact match domain as opposed to purely the fact that it is an exact match domain. For example seomoz.org is highly relevant to the search query “seomoz”, but is seo.com the best for the search query “SEO”?

We know that Google do a reasonable job at figuring out whether a domain is an entity or not which some SEO’s believe is the reason why exact match domains do so well and why as an SEO tactic they might not be as powerful as data suggests. This is down to the fact that the domains that are entities e.g. “seomoz.org” are exact matches when people search for that entity. The argument is that these entity searches impact all the non-entity domains that happen to be exact matches in these types of studies and in my dataset.

Having said all that, I’ll let the data talk and you can infer your own conclusions.

Chart: Exact Match DomainsDescription: Tags: Author:

 

Comments: These correlations are really interesting and very significant. They are pretty much in line with SEOMoz’s results although they only tested for exact match and exact match .com domains. I have run tests on other less influential factors and comparing the above data to those results and the SEOMoz correlation study it seems that exact match domains are one of the most powerful factors, but they seem to have declined in significance over the last few years.

SEOMoz reported a 0.38 correlation for .com exact matches in their original study and then a 0.22 correlation in their latest, my data shows a further decline which seems important to note.

I would speculate that this decrease in the correlation of exact match domains can be attributed to Google refining their algorithms to detect when an exact match domain bought for ranking benefits and when it is an entity that deserves the push up in the rankings.

The correlation data broken down into the various domain TLDs is important because beyond .com and .org exact matches there is a significant drop off in terms of a relationship between having one and ranking better in Google.

There are a couple of reasons why this may the case:

  • Google may value these domain TLDs less and implement algorithms to penalise (or reward others more) for exact match domains.
  • The domains with the less common TLDs may have been bought because there is less demand for them and therefore getting a domain that the webmaster believes will rank well is easier. Thus it would be other algorithmic factors (potentially the entity extraction ones) that would penalise the non .com exact matches.

This information is very important if you are registering a domain, it seems highly likely that something, either directly in the algorithm or an indirect factor is causing .com exact matches to rank significantly higher than its counterparts.

I did the parse of the search results from proxies within the US which doesn’t have a national TLD (although .us is its official one, in practice its not widespread in use and doesn’t hold the value of other local TLDs) so its hard to tell if other local TLDs, for example .co.uk or .ie, which are more popular in their home countries would have similar correlations to .com exact match domains.

Correlation would most vary based on the country and how the local TLD was used and managed in that country.

In the future I hope to run crawls from proxies within other English speaking countries with these prominent local TLDs, and will then be able to answer that question.

Note: The amount of data available for .info and .us domains wan’t anywhere near as much as the other TLDs and thus the size of the scientific error is likely to be higher. I have a very large dataset (1.2 million URLs) and because of this and the fact that the results seem in line with both common sense and the TLD correlation data above I suggest that they are relatively accurate but are likely to be slightly less accurate than the .com, .org and .net correlations.

Hyphenated Exact Match Domains

Ah yes, the good old hyphenated exact match. Often portrayed as the next best option if you can’t get that exact match .com domain.

Lets see if that portrayal has merit:

Chart: Correlation Data for:Hyphenated Exact Match DomainsDescription: Hyphenated exact match domains correlation data.Tags:

 

Wow. That’s really interesting. Hyphenated exact match domains are just nowhere near as correlated as exact match domains without the hyphens.

Plus they aren’t very user friendly, so maybe its time to rethink our strategy on hyphens?

The only potential downfall vs. the non-hyphenated correlation data is that it is again even less likely to be an entity using hyphens in their domain name.

Once again we see .com’s on top with .org in second.

Note: due to insignificant quantities of data I didn’t test .info or .us domains for this factor.

More Data

Now we’re delving into some more unique factors.

Chart: More Domain Related Correlation DataDescription: More Domain Related Correlation DataTags:

Partial match domain is when you have the keyword in your domain but its not an exact match domain e.g. “tech” and techcrunch.com.

Partial match ratio mean what percentage of the domain is taken up with the partial match. With the Techcrunch example it would be: “tech” = 4 characters, domain (“techcrunch”) = 10 characters, partial match ratio = (4/10)*100 = 40%

 

Note: the * beside partial match domain, partial match ratio and keyword is first word in domain means that I have excluded exact match domains in calculating the correlation for these factors. This is just common sense because Google wouldn’t reward exact match domains twice for a very similar factor.

Also note that the negative correlations for the number of characters and hyphens in domain name means that as the length/number of characters/hyphens in the domain name the ranking of that domain decreases i.e. longer/more is worse, shorter/fewer = better.

I suspect that the number of characters in a domain name is not something Google worries about unless it penalises very large domains. But what this large negative correlation most likely shows, is that there are other factors that are impacted by having a large domain.

For example the social sharability of your domain is reduced because in social cyberspace shorter = better. Also websites probably lose out on that brand factor or potential type in traffic due to the increased character length.

With the low correlations for partial match domains and the partial match ratio, it appears as if having the keyword in some of your domain isn’t very beneficial. It’s either exact match or forget it.

Summary

There’s a ton of data to digest here with 20 factors tested.

.com domains, be they tested on their own or as part of exact and hyphenated exact match domains came out on top with .org consistently in second.

If your buying a domain, .com came out a convincing winner both the in correlation data and the user/usage data.

Exact match domains are very significantly correlated to ranking well but there is a significant drop off in influence for non .com domains. .org exact matches were relatively well correlated but beyond that there was a continued progression towards nearly no relationship.

It is likely that these exact match domains bought on less popular TLDs e.g. .info, .net are either targeted directly by Google for looking suspicious in that they are more likely to be bought for their ranking potential or they are bought for their ranking potential and Google penalises them with their entity detection algorithms or through other factors.

Hyphenated exact matches beyond .com ones held nearly no correlation.

Domains with fewer characters and less hyphens in the domain name did significantly better.

And having a partial match domain even a relatively well populated one had only marginal benefit.

Actionable takeaways

Of course the data doesn’t prove causation, but with some common sense and mental analytics I have come up with a handy list of takeaways for the next time you are buying a domain:

  • Buy .com, .org, or a local TLD (.com preferably).
  • Avoid other TLDs like the plague!
  • Search hard for an exact match, but don’t dilute the brand of the site to get one.
  • No hyphens please (unless absolutely necessary).
  • Shorter is way better.
  • If you can’t get an exact match, don’t compromise branding to get a partial match, its not worth it (although having your main keyword in the domain name might be a good branding idea).
  • Entities are important, own your space with marketing, PR, clever link building, microdata, etc.

Hope you enjoyed the post and if you have any thoughts, ideas, criticisms, possible explanations please leave a comment below.