Monthly Archives: February 2012

The Science of Correlation Studies

Lab scientist representing science in search engine optimizationThe initial part of my project to bring more science to SEO is based on doing a correlation analysis on individual potential factors within the Google algorithm.

An incredibly important part of the project is making an impact on how SEOs and webmasters do business, a core to taking action based on correlation studies is to have an understanding of what a correlation study is, what are its weaknesses and what’s good for.

Don’t worry if you don’t know anything about statistics, maths, science, SEO or programming its not required to understand what a correlation is. Heck I didn’t know what it was until a couple of months ago.

I’ll try to keep the jargon to a minimum and hopefully you will be an expert on Spearman (woops that’s jargon) by the end of the post.

A Relationship

A correlation is just a fancy name for a relationship. Basically what I am trying to prove is:

  • Is there a relationship between a factor and ranking well in Google?
  • If so how big is the relationship?
  • And is it a positive relationship (helps rankings) or is it negative (lowers rankings)?

To be technical a correlation is a relationship between two variables.

In our case the two variables will always be the ranking of a page in Google and whatever factor were testing for.

Throughout this article I’ll use PageRank as our example factor.

So if I was trying to prove a relationship between PageRank and ranking in Google I would go to Google take the top 10, 20, 50, 100 whatever number of results in Google and find the PageRank score for each of these results.

I would then calculate if there was a relationship, the size of the correlation and the type (positive or negative).

But how do you calculate that?……

Spearman’s Rank Correlation Co-efficient

Spearman’s Correlation Co-efficient is one of the many maths formulas used to calculate correlation.

For maths junkies the reason I’m not using the others is because Spearman’s doesn’t assume a linear relationship between the variables. That just means that the type of data I’ll be testing may not be as well suited to the other formulas as Spearman’s.

Here’s the formula looks like: Spearman's Rank Correlation Co-efficient (the formula)

I won’t go into what each part means, but if your interested here’s a great video that explains it for stats newbies.

The basic idea is that you feed your two variables into this formula and it gives you back a number between -1 and 1.

What the number symbolises is the actual relationship/correlation:

  • If the sign is negative i.e. between 0 and -1, then the relationship is negative (hurts your rankings) e.g. maybe the page loading speed will show a negative correlation because as loading speed increases, ranking should decrease. And if it is positive i.e. between 0 and +1 then the relationship is positive e.g. PageRank should have a positive correlation.
  • Think of it this way,if one increases and the other increases then the correlation should be positive and if one increases while the other decreases then the correlation should be negative.
  • The closer you are to one, either positive or negative one the stronger the relationship. I.e. a correlation of 0.4 is a stronger correlation than 0.25. And the same goes for negative numbers -.5 means the correlation is negative and it would be worse for your site’s ranking than a -.15 correlation.
  • If you get a correlation at 0 or close to 0 it means there is little or no correlation.

To recap the correlation will be between -1 and 1. Negative number means negative relationship and vice versa. The closer the number is to either -1 or 1 the stronger the relationship. A correlation at or close to 0 means there is no or a very weak relationship.

Example Spearman Calculation

In this example (totally fictitious) we are looking at the top 10 results in a Google search. And we have gone and found the PageRank for each of these results. Below is the table representing the two variables.

Ranking PageRank

#1                    6

#2                    7

#3                    4

#4                    3

#5                    5

#6                    4

#7                    3

#8                    2

#9                    1

#10                  1

When you feed this information into the formula you get: a correlation of 0.894 (remember this is only an example).

This correlation would mean there is a very strong positive relationship between PageRank and ranking well in Google.

What I do is calculate the Spearman correlation for each of the top 100 results for 12,000+  searches and whatever factor I am testing for. I then get the mean (average) correlation, and that’s the one we’re interested in.

Gathering the Data

Binary code and several laptops representing large vats of dataSo we’re going to go a bit of the track here and explain how I gather my data to be analysed.

First thing I did was get myself a list of over 12,000 keywords, 800 from each of the 22 categories on Google Adwords Keywords Tool and remove the duplicates.

Then I wrote a program to go and get the top 100 results for each keyword so approx. 1.2 million URLs. Then I removed Google News, Images, Videos, Maps and Products results because they are generated by a slightly different algorithm and we can’t control them anyway.

Then I choose a factor to test, in this case PageRank. I figure out a way to get the desired data for each factor. For PageRank I would write a program that goes to the web asks a special URL for the PageRank of each URL and store it in a database.

Then I take the PageRank ranking and the search engine ranking and calculate the Spearman for each keyword. I store the correlation for each keyword, bring it into an Excel sheet and get the average.

Obviously each factor requires a lot of thought, some programming and a ton of testing but that’s the basic process.

Weakness – Correlation is not Causation

Having a relationship doesn’t mean a factor is in the Google algorithm nor does it mean that there will be some indirect increase in ranking.

For example there might be a positive relationship between Facebook likes of a page and ranking well. But it is possible (if not likely) that Google isn’t counting this factor in the algorithm and the positive correlation is there because pages that satisfy other factors e.g. links, on page factors, etc. may get more Facebook likes.

There is also a possibility that the factor has a knock on effect. For example the more Facebook likes a page gets, the more people see it, the more normal links it gets, the more PageRank, the higher the ranking.

So it is possible to get a correlation without the factor being in the algorithm.

Another possibly more relatable example is that of a football player. There would probably be a strong correlation between being a professional music artist and going to lots of high profile parties. That doesn’t mean if you start going to these parties then suddenly you will have an amazing voice. This is a classic case of correlation not causation. Going to the parties doesn’t cause you to be a successful singer its just a bi-product.

This is extremely important, correlation is only a guideline. The correlation needs to be viewed critically and tested in other means to be able to say more definitively whether it is a factor in the great Google algorithm or not.

Counter Argument – SEO’s not About Causation

From a scientific point of view this is a flaw with the correlation study. It still holds serious scientific significance but if your goal is to prove whether a factor is indeed a factor in the Google algorithm then correlation studies are only a very good guideline.

But if your goal is to increase the ranking of a site or find out what it takes to rank higher, which is what SEOs are interested in then correlation studies have greater value.

In the Facebook likes example where the increase in Facebook likes didn’t directly cause an increase in ranking it affected other factors which did, does this mean as an SEO you should ignore Facebook likes?

No, in this case it means that you should try and increase your Facebook likes to cause that knock on effect.

So really it depends what your goal is.

If you are trying to figure out what it takes to rank high in Google then they are very good (but not perfect). If your goal is to prove whether it is a factor in the Google algorithm, their fairly good but of less value.

 

I hope I did a good job at explaining correlation and that you understand it better now. The main takeaways are that correlation is code for relationship between two things e.g. ranking and a factor such as PageRank. That the correlation is between -1 and 1, with negative numbers meaning a negative correlation and vice versa, and that the closer you get to either 1, the stronger the relationship.

And then that correlation results should be looked at as a great guide for how to do SEO and what’s important to focus on but not the holy grail and should be tested further.

I believe this type of science and hard hitting statistics in SEO will vastly improve the industry by providing credible data and information that’s far more reliable than guesses or observations over one or two websites. I’m looking forward to sharing with you my findings and seeing what interesting tests arise from the correlations.

What I Learnt About 342,740 Domains

I recently did a parse of 12,573 keywords, extracting the top 100 results per keyword on Google. And after cleaning up the data I was left with over 1.2 million web pages and 342,740 unique domains.

For the last week or so I have been looking for interesting information within this mountain of data.

I published data on top domains, sites, Google’s use of Images, Products and News results and some strange URLs I noticed.

This data is part of my project to bring more science to SEO, initially by doing a correlation study into Google’s algorithm.

Domain Data

I was looking into domain related data and I spotted some interesting patterns, nothing ground breaking but just some stuff you might find cool.

I should have some domain related correlation data out later this week so this is a insight into the domain dataset.

The domain name and the ending for that domain is a really important choice for a new webmaster. When you make that choice your choosing your brand for life, plus its a super important decision from an SEO point of view.

There are a couple of considerations to take into account. The user and the search engine. The correlation data I’ll show you later this week should take care of the SEO point of view.

But from a user’s point of view you obviously want a domain that’s memorable, easy to type and easy to link to. And that in and of itself is a big factor in SEO. If people are linking to the wrong domain you’re losing out on valuable link juice. If people can’t type it then you’re losing out on type in traffic and if user’s can’t share it then say goodbye to some social media clicks.

Domain TLDs

The domain ending or TLD (Top Level Domain) is probably the most important part a website’s address.

There is a technical difference between a TLD and a domain ending. For example .uk is a TLD but .co.uk is not, it’s technically a subdomain of the TLD. But webmasters and users don’t care about technical definitions, so I’m going to treat them as the same for the purposes of this article.

If you have a really catchy, social, SEO perfect site name it’s useless unless you have the right ending to that great name.

Turns out not that many people are typing www.greatname.washingtondc.museum.

I extracted the domain endings for all the sites in my data and collected a neat list of all the domain endings I could identify in these 1.2 million URLs.

I used a list of all known TLDs from the Mozilla crew (I think?), but I can’t find the link so if you know the list I’m talking about please post the link in the comments section.

Luckily I downloaded and cleaned up the list into a nicely formatted text file so you can iterate through and check for matches if your running tests yourself.

Update: Thanks to Kris Turkaly who left a link to the list in the comments: http://publicsuffix.org/list/

After running my scripts and programs through the data it turns out there were 437 different domain endings in the dataset.

Thinking about it that’s a pretty small number of TLD’s for 1.2 million URLs but as you will see there is huge dominance with just 3 of domain endings.

I ranked them in order of number of sites out of the 342,740 that had that TLD. Here’s a handy Excel list of  all 437 TLD’s in that descending order.

And a nice graph of the top 5 domain TLDs:

(You can hover over each bar with your mouse and you’ll get the exact numbers)

It was hard to see some of the smaller TLDs so here’s the next 45 (blown up and zoomed in) and .us repeated again i.e. the 5th-50th most popular extensions. Even within this subset there’s a really huge drop off from the top of the list. Combine that with the top 5 domains and you see a gigantic dominance of the top 2 or 3 domain endings.

Of course this dataset isn’t designed to find the most popular TLD’s but it’s probably a pretty good idea of what users are used to.

Realistically .com and .org are the only global domain extensions you should be going for from a user’s point of view. And even if you own a .org you should be on the lookout for the .com variation.

Domain Length

I thought it would be interesting to see the distribution of domain lengths, so I counted the length in characters of all 342,740 sites without the domain ending.

Out of interest the average domain name length was 14.75 characters long, but the most common length was 8 characters with around 1 in 12 sites 8 characters long.

Again you can hover over each circle for exact stats.

Here’s an Excel file based on the above graph with the domain lengths and number of domains with the corresponding length.

 

This post is pretty good at showing what users are used to and are likely to accept. Of course common sense is still required. For example if your in Ireland, Irish users are very much at home and used to .ie domian names so maybe grabbing that and the .com domain name might be a good idea.

Again, if you find a really great, catchy name that’s longer than normal then go for it and if you find a short one that ticks all the boxes go for it too.

Hope you enjoyed this post and stay tuned for some correlation data later this week.

Interesting Google Parse Data

I recently completed an automatic searching of Google of 12,573 keywords. I extracted the top 100 results for those keywords, so around 1.2 million web pages and 342,740 unique domains.

This is the dataset I am working off of for the initial round of The Open Algorithm Project.

Over the last two days I’ve been pouring over and cleaning up the dataset and I have seen some really interesting stuff. Things you don’t notice when you do individual searches, even if you dig into the HTML of a search result page.

It’s sort of an eclectic mix of interesting things that I and my programs spotted and thought you might find interesting too.

Most Popular Domains

There were in total 342,740 unique domains. I’ve put together an Excel document with all the domains and the number of times each one showed up as a search result.

You can check if your site is in my dataset and contributing to the project.

Here’s my list of the 50 most popular sites in Google by the number of times they showed up in the search results. That’s the number of search results they accounted for, not the number of keywords they showed up for i.e. a site could show up multiple times for one keyword.

Interestingly the top 50 sites from the parse (which used US based IP addresses) is very similar to the top 50 sites in the US as judged by Alexa.

Note: The counting of Google.com includes universal search which accounted for 7,542 results.

View it full size here.

Most Popular Sites

I wasn’t surprised by the fact that Blogspot, Tumblr and WordPress did well in the top domains data. But I figured that a lot of the search results that they were ranking for were on subdomains, that weren’t really under their control. So I went back to the database and this time extracted the results down to the subdomain or domain they were hosted on i.e. the site not the domain. Google are sort of half and half on how they treat subdomains, I suspect they have some algorithmic feature to distinguish between blogspot or wordpress.com style subdomains vs. blog.yoursite.com. They could be using more complicated algorithms but I would imagine they are certainly counting the number of subdomains and diversity in terms of content topics across those subdomains as a general guide to what’s associated with a root domain and what’s not. You can download the full list of 383,638 sites with the number of search result appearances from this Excel file. As you can see, compared to the top 50 domains, WordPress, Blogspot, Tumblr, etc. have taken a hit.

View it full size here.

Universal Search

I thought it might be interesting to extract when, where and how much Google shows Universal search results from images, news and products which of course are all Google entities.

Image Search:

Number of times image results showed up (out of 12,573 keywords): 2,327

Modal/most common position: 1

Mean/average position: 8.01

Google News:

Number of times news results showed up (out of 12,573 keywords): 2,195

Modal/most common position: 11 (remember I extracted the top 100 results)

Mean/average position: 9.86

Products Search:

Number of times products search results showed up (out of 12,573 keywords): 3020

Modal/most common position: 2

Mean/average position: 8.8

Strange URLs

Update: It appears I was totally wrong on this one. While these strange URLs were indeed strange and were about a new Google feature. It wasn’t quite a feature I had thought about. Seer Interactive reported a change to the way Google formats its search results that now uses these strange URLs.

Out of the approximately 1.2 million URLs I extracted, I got back 185 really weird ones.

You can view the 185 of them here: http://www.theopenalgorithm.com/media/2012/02/Weird-Google-Search-Result-URLs.xlsx

All 185 URLs redirected to what would be a normal search result. There were only a handful of websites that were redirected to from the 185:

http://www.ncbi.nlm.nih.gov

http://www.google.com/publicdata

http://www.google.com/finance

http://www.cricbuzz.com

http://www.carlingcup.premiumtv.co.uk

http://www.ncaa.com

http://en.uefa.com

Pure speculation but that lengthy parameters (examples in the Excel file) would suggest that Google is for some reason tracking the clicks on these results in some special way, potentially to track and test the reaction to new algorithmic or other type of feature being trialled.

Interestingly the CricBuzz URL redirects to www.cricbuzz.com/gl (gl presumably meaning Google) but not all clicks from Google to the CricBuzz site redirect to /gl. It seems as though CricBuzz are tracking clicks from specific types of Google search results (namely, live cricket scores), for what reason other than internal is again only speculation.

All of the results with these strange URLs are in some way unique. Take the Google public data result, there’s a nice embedded graph:

Google Public Data search result

Or any of NCBI results:

Google's treatment of health related search results

I’ve highlighted 3 unique features of this result.

#1 The snippet is pulled cleverly by Google directly from the page.

#2 The special health identifier image to the left of the result.

#3 The links to sections within the article.

Of the 185 strange URLs pulled from the 1.2 million 170+ were from the NCBI site.

It would seem as though this site is being treated differently by Google. Maybe they recognise it as the expert in its field and are trialling some sort of improved indexing of expert health related sites?

A number that sprung to mind was 1%, as suggested by Stephen Levy in his book about Google, In The Plex, about 1% of searches have some sort of trial or something being tested outside the regular algorithm. 185 out of 12,000 searches isn’t far off 1%…food for thought.

Of course it could be something totally different, maybe a Googler or SEO can shed some light on the intriguing results.

I would be interested to hear if other people doing Google search result parsing over a long period of time have seen these types of results and if they in some way connected to some future feature launch within the search engine?

FTP and IP address

I wrote my programs to handle http:// and https:// web pages only. I had never noticed Google return any other type of page and assumed that’s all that would come back from my parse.

Turns out Google are happy to crawl and index more than just that. I’ve seen a couple of ftp:// pages returned e.g. search for “garmin” on Google.com, click through to around page 8-11 and you will see:

ftp://ftp-fc.sc.egov.usda.gov/WI/gistoolkit/garmin.pdf.

In fact out of 1.2 million URLs Google returned 22 ftp:// results, all but one of them were PDFs.

Here’s a list of the 22.

Notably a number of these seem to be meant to be private or only for a company intranet, so its impressive if not worrying that Google has found these documents.

Another strange one was one site returned only as a an IP address with no domain name, check it out: http://218.210.127.131/, the site’s domain name is realtek.com but for some reason Google saw fit to return the IP instead of the domain name. Maybe because the site doesn’t work with www. in its domain name?

 

Hope you found the post interesting, if you spot anything fun or unusual in the data files I have linked to in this post let me know in the comments section.