On Page Factors Correlation Data

Sometimes the least interesting results are the most interesting.

Long gone are the days of stuffing meta data and H1 tags full of keywords and getting high rankings, all SEOs worth their salt know that. What we will see, over the course of this post is a shocking lack of correlation for the traditional factors that are drilled into all newcomers to SEO by a media machine fuelled by outdated information.

Many industry leading pundits have touted such factors, and sentiment surveys run by SEOMoz show that many experts are likely to be massively out of touch with the true state of SEO.

In the coming days we will also see some fascinating results for other apparently less “in vogue” factors.

In this post we will examine the correlations I have found for 41 on page factors using Spearman’s Rank Correlation Coefficient and shed some subjective light on the possible meaning of these findings.

If you haven’t already, I highly recommend reading up on what a correlation really means, viewing some of my earlier findings to be used as a benchmark against the below findings and if this is your first time on the site make sure you read “what the project is all about“.

Once again I will warn of the dangers of taking correlations at face value and the importance of a more holistic view of the data below, the above links explain my reasoning.

Quick Recap

TheOpenAlgorithm is a project to bring more science to SEO. The initial phase is a correlation study examining the relationship between the ranking of the top #100 results for over 12,000 keywords and over 150 factors that might impact ranking in Google.

I’ve already published a bunch of data and will continue to do so in the near future.

Looking more long-term, I hope to test more factors on more data and ultimately prove causation between factors and ranking in Google.

There are also some people who have been vital to the success of the project, who I’d like to thank.

Data

Chart: On Page Factors' CorrelationsDescription: Tags:

For the eager beavers among us you may enjoy seeing the individual correlation calculations for each keyword I examined, here’s a handy excel sheet with everything you need.

Analysis

The first thing that strikes you with regard to the chart is the startlingly low correlations for each of the above factors.

Lets go through the most important correlations:

Title Tags

Chart: Title Tag Related FactorsDescription: Tags:

Historically title tags are a favourite of SEOs. Consistently sighted as important, even the steady demise in the value of on page factors doesn’t seem to have affected their weight in the average webmaster’s mind.

In fact the two factors with the highest sentiment in a recent Moz survey were; having the keyword present in the title tag and having it at the start of the title. The correlations in both this and the Moz study, almost directly contradict the industry’s leading thinkers.

The correlation of all of the factors related to the title tag recorded above are so close to zero they can be considered random.

Classic factors such as having the keyword in your title tag, starting the title with a keyword or using variations of the keyword multiple times all showed near random correlation.

What this likely means is that, these factors have little or no bearing on the ranking of a web page in Google.

Meta Keywords

Chart: Meta Keywords Related FactorsDescription: Tags:

Despite knowing that Google don’t take the meta keywords into account when ranking a web page, I decided to test related factors for two reasons, the first being to check if there were penalties for using it and the second as a handy benchmark for what can be considered a random correlation in this field of factors.

As we can see having the keyword in the meta keywords is the 4th most positively correlated factor out of the 41, thus anyone trying to justify the importance (based on correlation data) of any other factors of similar levels of correlation is silly.

There doesn’t seem to be any penalty incurred as a result of use of the meta keywords tag. Although I wouldn’t rule it out, abusing the tag likely being one of several ingredients Google may use as a signal for possible low-quality, over-optimized or spam content.

Meta Description

Chart: Meta Description Related FactorsDescription: Tags:

Again one of the great stalwarts of old-SEO is shown to have little or no value to the ranking of a web page. But before you throw out the meta description tag I must recognise that it has other value in terms of allowing you to control (to some degree) the highly important snippet shown to users on Google SERPs.

Heading Tags

Chart: Heading Tags Correlation DataDescription: Tags:

The use of keywords in H1, H2, H3 and H4 tags is one of webmaters’ more popular factors in search engine optimization, despite lately being discounted as less important by a number of SEOs.

We see no significant correlation for these factors in our study and thus no reason to consider them important or vital to the successful SEO strategy utilised by the intelligent webmaster.

Images

Chart: Image Related FactorsDescription: Tags:

One of my old personal favourites; keyword usage in img names, alts, and titles has been consistently trending for quite a while now, primarily due to their relatively reduced popularity in the pre/early Google hay-days of SEO, resulting in their exclusion from the typical list of on page factors Google may be ignoring.

While these factors may be important in getting traffic from Google images it is appears likely that their status is indifferent towards increased search engine ranking.

Links

Chart: Links On Page FactorsDescription: Tags:

While I would have liked to test some more advanced factors related to outbound and internal links, unfortunately that will have to wait until the second iteration of the study as I lacked the required computing power to test such factors.

For the factors I did test, some of which were fairly original and I was quite excited about, I once again saw close to random correlation.

Furthermore I can advise against using abnormal strategies in terms of anchor text or nofollow tags of outbound or internal links for the SEO benefit of the linking page. I recommend linking in the most user-friendly manner and using the nofollow tag for its intended purpose of flagging potentially editorially unsound links.

Unique Tags

Chart: Unique HTML TagsDescription: Tags:

I was also pretty interested to see the correlation for the presence of the canonical tag on a page, which I thought might be a good sign of an astute webmaster who is protecting against potential duplicate content penalties, but I was also worried about a possible cancelling effect caused by web pages groomed for Google utilising the tag the most. I saw no significant correlation, though I would still advise you to use the tag to protect against Google crawler errors.

Noframes and noscript tags are good indicators of user conscious sites using iframes or Javascript respectively. These factors are easy to test indicators of sites using these technologies. In the future I may look at different methods to test directly whether utilising iframes or Java impacts ranking but for now no definite conclusion can be drawn.

Rel=author and rel=me were new tags implemented to allow writers to identify their work on various sites and to allow Google to use their information to provide more relevant search results to users. In addition they are of course very closely related to a rather hot-topic of late -Google + profiles and their importance to SEO. The mere presence of these tags have no correlation but perhaps in future studies, the power of the linked Google + profile may be a factor I will test.

Length

Chart: HTML LengthDescription: Tags:

The length in HTML and the HTML within the <body> tag were the highest correlated factors, in fact with correlations of .12 they could be considered somewhat if not hugely significant.

While these factors probably are not implemented within the algorithm, they are good signs of what Google is looking for; quality content, which in many cases means long or at least sufficiently lengthy pages.

Summary

The consistent lack of significant correlation for on page/HTML related factors many of which are reputed to be highly important to SEO is an indictment of some of the poor information available to SEOs.

You’re relevant, but are you quality?

One mitigating factor stopping us all abandoning the old school SEO strategies is a theory that may explain some of the low correlations. That is that once Google identifies a web page, based on some on page factors, as relevant to a keyword, they then may throw that to one side (or insignificantly weight the relevance level) and then look at other factors such as links, website authority, social media, etc as a measure of quality.

What this would mean would be that it would still be important (although less so) to use keywords in some of these areas to make sure Google understands you are relevant to the query and then focus on other higher correlated factors.

This is of course merely a theory and not one that would require much implementation, assuming that if you are targeting a keyword you will already have quality content relevant to that keyword.

Tests on more results per query e.g. top 1,000 or 10,000 results/query, or on a set of less competitive keywords may be more revealing in the viability of this theory and thus without such tests or causal data available I can not be 100% sure in my devaluation of the importance of these factors, but the data available would strongly suggest such a devaluation.

If you can manipulate it, Google probably aren’t using it

Arising from these results and the sustained trend of Google moving away from easily manipulated factors I have come to a pretty common sense and simple conclusion: that if a factor is directly and easily manipulated by a webmaster, the weight assigned to that factor is likely to be relatively low.

This would therefore presumably apply to Schema.org an initiative I believe to be of high importance to search engine optimization. Thus this rule of thumb may be proven incorrect or vindicated in the future, I guess we will just have to wait and see.

Trust the numbers, not the opinions

While I am sure that this post will be fairly controversial as it runs against many long-held beliefs within the industry, the findings have been backed up by the similar results seen in the SEOMoz study and by Google’s continued stressing of the importance of focussing on quality not arbitrary on page factors.

Thus we must learn to trust the numbers. While of course numbers aren’t perfect and I have outlined potential flaws, at least they aren’t open to manipulation by human emotions based on untested and unvindicated opinions that have outlasted their sell-by-date.

These results in particular show the need for emotionless analysis of SEO theories and the value or lack thereof opinions and observations not backed by scientifically significant levels of data.

Was Matt Right?

How many times has Matt Cutts said not to focus on individual on page factors but to focus on creating a page of value for the user?

Many of us, fooled into believing in the unquestionable value of these factors,  including myself saw this as a classic attempt to shape our ideas of “what SEO is”, and Cutts trying to stop people from manipulating weaknesses in the Google algorithm.

As it turns out he was probably speaking the gospel that I personally and many others needed, yet ignored.

The Self-Fulfilling Cycle of Industry Drivel

One clear observation arising from the above results is the need to question commonly-held beliefs not just within SEO but everywhere.

The problem with SEO is that search engines are always changing, theories established 3 years ago may be outdated and need continued testing and re-assessment.

But what we have seen within some areas of the industry is a belief becoming so ingrained in SEO’s psyche that it is near taboo to question it. It is indoctrinated into every individual entering the industry.

As a result with no particular scientific proof authors, including myself, wrote about the importance of such factors, and strategies to maximise gain from the manipulation of these factors.

This only further fuels the fire and increases the apparent importance of these arbitrary data points.

Essentially its a house of cards, and hopefully this data and further tests will prove to people that these factors just aren’t that important and as a result the house will tumble, although, I suspect such a major change in the beliefs of an industry are harder changed than this.

 

While I don’t rule out the viability of on page SEO, I believe that Google use much higher level factors than we within the industry have thought about or used. I am aware of enhanced relevancy algorithms such as TD*IDF and think these efforts at better understanding Google are admirable.

On the whole I am fairly pessimistic about the value of on page SEO under current theories and strategies. I would suggest SEOs put no more than 5% of their time on the subject, much of which should be focussed on creating a structure that germinates well optimized content (similar to many of the WordPress features).

I would recommend focussing marketing budgets on creating unique, high quality content, optimized for the user and focus the majority of the SEO effort on building a beautiful, fast website architecture, link building, social media exposure and other well correlated factors.

25 thoughts on “On Page Factors Correlation Data

  1. joshua

    White hat SEOs are afraid to do anything else other than onpage SEO. Therefore, they have to perpetuate these onpage SEO myths, or they are out of a job.

    Reply
    1. Marcos Lujan

      The top #100 results? Really? I’d love to see those keywords too. If you set up the apparatus wrong you won’t be able to draw a sound conclusion from your results.

      Keyword in title tag a tiny negative correlation? This is where experience trumps theory. The Title tag is one of the biggest factors. If you’ve ever changed a Title tag and seen rankings tank quickly you’ll know what I’m saying is right. If all you’re getting is -0.01 for Keyword in Title you’ve definitely set something up wrong.

      You might want to ask people who’ve changed CMS’s or Themes what happens when title tags go missing or become duplicate…

      Reply
      1. Mark Collier

        Hi Marcos

        The apparatus has definitely been set up right, the method is described more in-depth in http://www.theopenalgorithm.com/about/ and http://www.theopenalgorithm.com/the-project/the-science-of-correlation-studies/

        While I appreciate that it may seem baffling that some factors are correlated differently to what you would expect, I myself have questioned and re-checked a number of results that from experience and what you hear in the media seem wrong, but they aren’t.

        There are a couple of possible reasons may seem wrong:

        1) Correlation isn’t causation and doesn’t necessarily prove anything, although it is a good guide.

        2) A tested factor may be present or not present in 99% + of tested pages, if positive or negative results are recorded for all pages then finding a significant correlation isn’t possible (although I’m not aware this has happened for any of the above tested factors [see the spreadsheet in the post]).

        3) Your original idea of the importance of the factor is incorrect/weighted incorrectly.

        While I admit title tags are touted as important the above data doesn’t back that up, but if other scientific tests prove this data wrong then that’s fair enough.

        Reply
        1. Marcos Lujan

          If you could send me a small but significant subset of the keyword queries used I would appreciate it. It also would help if in this study you added a factor that correlates well such as PR to the results as a test control.

          Reply
          1. Mark Collier

            Hi Marcos

            I will be publishing my code and probably the full dataset once I have posted the rest of the results from the study.

            I have already published some well correlated factors, see: www.theopenalgorithm.com/correlation-data/domain-name-seo/ and I’ll be publishing several even higher correlated factors in the coming weeks.

  2. David Sewell

    You are only comparing 2 variables at a time using Spearman’s correlation coefficient (onpage element vs rank). These onpage factors work in combination, not isolation.
    To quote you: “In fact the two factors with the highest sentiment in a recent Moz survey were; having the keyword present in the title tag and having it at the start of the title. ”
    The key to the sentiment in that statement is the word AND.
    Your test here has not explored the combination of (keywords in title) AND (proximity to start of title) vs rank.
    To do that, you’ll need linear regression or multiple correlation tests.
    Then I think you’ll see stronger correlations with rank and multiple variables…

    Reply
    1. Mark Collier

      Hi David

      I totally agree and plan in the second iteration of the study to explore the combinations between factors. While I won’t pre-empt the data, I’m not sure there will be any change in the level of correlation for many factors.

      Reply
    2. Odds87

      Really can’t stress David’s point enough. It’s called an algorithm for a reason: it’s the sum of many, many different variables. And how these variables interact and modify each other is something we’ll likely never know.

      Reply
  3. rjonesx

    Part of the issue is that they are non-differentiating factors because “everyone is doing them”. If nearly all 10 have the title tag optimized, it will be hard to find a correlation. 

    Reply
    1. Mark Collier

      It does appear that way to us in the SEO industry but if you stand back and look at the data, many webmasters aren’t optimizing these things. A great way to test this for yourself would be to take a random 20 keywords and look at the top 20 results, are they all doing SEO in these areas?

      Reply
    2. Chris McGiffen

      Totally agree and have long thought that there must be another way to do these experiments, possibly comparing two groups – along the lines of comparing what isn’t ranking against what is. Still to work out the practicalities of it though.

      Another thought I had whilst reading this is that since individual factors appear to have such little weighting, is there any particular combinations that result in stronger correlations?

      Reply
  4. Mark Asciak

    I’d have to agree with what you’re saying. I’ve worked on sites where I have been very thorough with all the ‘standard’ on page elements, other sites I’ve done the very basics so they are there for Google to see and I’m seeing less of an advantage is wasting hours on the on page stuff, either way the sites have ranked fine once some quality link building has been rolled out.

    I still think sticking the keywords in the page title and a heading along with the content is important so Google knows what the page is about, but anything more is a waste and as we all know there are now bigger dangers in over-optimization.

    Reply
    1. Mark Collier

      Hey Mark

      That sounds like a pretty smart strategy. I agree that doing some simple on page SEO is probably required to let Google know what the page is about, but the majority of that should be taken care of by virtue of the fact that your writing the article about that keyword.

      Great to hear the data being backed-up in the real world!

      Reply
  5. Anonymous

    Coming from an agency background I’ve interviewed many candidates for a variety of SEO positions and I would say the quality has got worse over the years not better. Finding an SEO who has a firm grasp of even the basics is difficult sometimes. I think this is due to the massive amount of misinformation and the self proclaimed ‘experts’. My motto has always been I’m never 100% sure it’s true unless I or some one I know has tested it.

    It’s not helped by this ‘big concept’ rubbish you get at conferences. Untested ideas and strategies by speakers who are wishing they are speaking at a ted talk. They need to be talking about how to identify, test and report. It’s these simple basics that should be the staple of the SEO’s tool bet that seem to have gone out the window. I think people assume that’s it’s all been figured out and we can just follow the herd and be fine.

    I commend this work as it’s actually trying to give the industry a much needed kick up the backside.

    Reply
  6. SEOetc.

    I really liked the study (as a process) but may I say something it might sound arrogant. There are many pages in your site (actually the whole lot) I really find of great value. Why on earth i had never come across your site before?

    Seriously, and I might sound foolish, but I’ve been in the industry for 8 years and this is the first time I came across it.

    I find it very difficult for Google to “capture” the essence of value, whereas for relevance it’s pretty much spot on (not in some countries though. I moved to Turkey for the UK and the search eco-system here is VEEEEERY different. Anyway….

    Back to the quality issue. I feel like I’m trapped in a search bubble (the filter bubble is a great site by the way…) where quality is buried under relevance. But relevance now days is monopolised by “the big boys” which have become so corporate that feels like the TV back in the 80s.

    Maybe I’m getting away of the subject here but I’m genuinely annoyed by the inconsistency of the search engines. No i don’t want a “cookbook” to do my job. I just want to be rewarded (my clients to rank) for doing the right thing.

    I have some hopes for Schema.org (I’ve been working on full implementations for 2 very large e-commerce sites) as long as they don’t abuse it. So far only the Reviews vocabulary is “paying back”… Let’s see…

    Reply
    1. Mark Collier

      Hey man, thanks for commenting, glad you found the site helpful, its pretty new and I’m afraid I don’t blog very often (hoping to change that now, although I’ll never be a daily blogger).

      I’d love to hear more about how Schema.org works for you, its something of real interest to me but I’m not sure to what level the search engines are going to use it.

      Reply
  7. BrianW

    I’m very hesitant about these results — if indeed almost every on-page factor has almost no correlation to ranking ability, then it would be just as easy to rank an unrelated page with no keywords in the title as a well-optimized page. We see literally every day that this is not true.

    Perhaps it would be useful to couple some of the more surprising correlations (like keywords in title don’t help rank at all) with some simple and easy to conduct testing.

    Reply
    1. Mark Collier

      That’s true, of course there may be more factors that I haven’t tested. Plus there’s other relevancy factors such as anchor text, the content of social media shares, etc.

      In addition there is the possibility that Google narrows down the field to determine relevance for using on-page factors and then uses other factors to determine quality.

      There are many other possibilities and I’m in no doubt they use other on page factors, but they don’t seem to be the ones talked about in the SEO industry and thus the logical conclusion must be that our strategies based on these non-existent factors don’t work.

      Reply
  8. Owen

    Hi Mark,
    I’ve been doing some thinking around these results and was wondering if you have taken into account a few factors and might consider some ideas:

    - The type of page which ranks wellFor instance the homepage of a site will has the most ranking potential. The home page will most likely not carry certain elements such as an author tag. A blog post which would have less ranking potential would. Therefore are you seeing pages with author tags typically ranking lower than those with.

    Some kind of analysis of URL depth may provide some additional insight here, with factors compared at different depths.

    - Factors in Combination
    Already mentioned so I wont rehash it. Only would say that something that considers off page factors in combination with on page factors would be essential, rather than just on page in combination

    - Testing and Refining

    With a big enough sample size could you not include some seed pages which you control. You can weight these according to your findings. Now adjust the on page elements on your seed pages and work out the impact within your model. Now compare this to the actual impact in the SERPs. Repeating this process will allow you to refine your model.All that’s left to say is excellent work and thank you for taking on such a mammoth task. I have a feeling you may get some flak from some, but that’s only because your giving them a kick up the back side :)

    Reply
    1. Mark Collier

      Hey Owen

      Great comment, here are my thoughts:

      Type of page – many of the URL depth and similar factors you suggest, I have tested and will publish the results for in the coming weeks. Other such as identifying are certainly excellent ideas and building reliable artificial intelligence algorithms to accomplish this may be tricky, although there certainly are existing APIs that could provide this data if needed. I may try testing some of these in the second iteration of the study.

      Factors in combination – totally agree, as I have stated below, I plan to test many combined factors in the next study. It was important for me to test individual factors first as I will have somewhat of an idea on the likely weighting of individual factors in a combination.

      Testing a refining – I have similar albeit somewhat larger ideas in regard to testing for causation. Its something I have thought about and I would want a scientifically significant number of websites/web pages and significant infrastructure before embarking on these kinds of tests. I have a general idea for a similar test and certainly plan to implement it in the future.

      Great to see we agree on many aspects of the project, definitely the best analysis of any of the results I have published.

      Thanks

      Mark

      Reply
  9. Winooski

    Mark, thanks for the heavy lifting, insightful commentary, and, above all, steadfast promotion of scientific investigation and methodology. In the spirit of scientific corroboration, it would be great to see others replicating your findings using the same data (or, even better, with similarly-collected datasets of their own), but this is a tremendous start.

    Oh, and…Happy Birthday!

    Reply
  10. Andy Langton

    I’m all in favour of data driven and scientific approaches to SEO, but I can’t help but feel you don’t really understand the data you have – and thus your bold conclusions are questionable.

    By the time you’ve grabbed your data, Google has already performed a great deal of evaluation. Indeed, you’re skipping the entirety of the indexing stage. And at the point of results retrieval, Google has already selected the top results (probably 1000) to re-rank, so at best (and assuming you’re scanning the entire 1000 results) the correlation tells us about improving ranking at that stage only. If you aren’t scanning the top 1000, then your correlation results are fatally skewed towards top performers, and will thus reflect differentiating factors amongst a relatively similar group. Which of course, is where most SEOs know off-page factors are key.

    To use a very simplified example. Let’s say you correlate rankings with URLs that don’t have any readable text at all on them. I would expect you’ll find a very weak correlation, since Google dropped all those URLs long before they touched your dataset. But it would be unwise to conclude that including text is not a ranking factor and has no impact on rankings.

    At best, you might conclude that getting basics right is an easy ‘checkbox’ that almost all the top 1000 sites complete successfully. But that ignores the poor levels of optimisation on a typical website (one of those that didn’t make the cut, or your dataset).

    And even beyond that, this type of correlation analysis assume a very basic keyword-matching approach to textual evaluation, which no SEO with decent on-page skills would be using. Think co-occurrence, not occurrences.

    Frankly, all the SEOs in the room will understand why you want to shout loudly about your findings, and use words like “drivel” to describe others – it’s not good science that gets press, right? But without qualifying your conclusions, it’s not good science either.

    Reply
  11. Ewan Kennedy

    Hi Mark,

    Congratulations on how far you’ve got with the project so far. Testing data is always interesting and you never know when you start what you might unearth. I look forward to following your progress.

    In reply to your statement about on page factors not counting for much, I don’t think I’ve ever changed a title tag and not found an improvement in rankings a few days later. I’ve tested title tags ad nauseam and found that a well crafted title tag can not only cause (and I mean ’cause’ with a >95% confidence level based on thousands of changes and watching the effect i.e. the ‘before’ and the ‘after’) improvements in existing rankings but also create new rankings.

    That said, it’s always good to stimulate vigorous debate and challenge dogma.

    I agree with your statement that on page elements that are easily manipulable are likely to have a lower level of importance. However, I would also argue that, although elements such as the title tag and H1 are clearly easily editable, they are not that manipulable because of their visibility. In other words, they are so visible that it places a limit on how far they can be stretched without destroying the page’s credibility and that acts as a self-balancing mechanism which in turns allows a high weighting to be placed on them, unlike for example certain invisible meta data.

    Good luck with the project. I’ll hook up on social media.

     

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>