Category Archives: The Project

The Future of TheOpenAlgorithm

After an incredible few days at SearchLove I realised I kept getting the same question from SEOs who have been following TOA, “are you still working on that correlation thing?”

Shit, YES, I did take a break to work on some other stuff but I’ve been working my balls off learning SQL, taking courses on writing better code, writing scripts, reading research papers, designing experiments, plus I’ve been working on other cool projects.

Woops, my bad, I guess I broke the cardinal rule of blogging, relationships and just about everything else, “keep your followers in the loop.”

I’ve known where TOA is going next, for a few months now, but over the last couple of weeks I’ve ironed out the details and am ready to start gathering a a massive dataset for the next iteration of the project.

In this post I’m hoping to answer that question and essentially I’m going to out myself to the industry so if I don’t reach the targets set below you guys can hold me accountable.

What’s next

As everyone knows the ultimate goal for this project is to research more causal relationships between factors and ranking in Google and to create a model search engine algorithm.

I’ve done a lot of reading and emailing in the last couple of weeks and it seems like weighted regression using the pointwise learning to rank approach is my best shot at creating a successful model.

I’m sure this all sounds gobbledygook to most people reading this post, because in truth some of this is still gobbledygook to me.

I’ve never taken a stats course, only run a couple of basic multiple linear regressions, don’t really know anything about machine learning and I’ve just finished differential calculus in my last year in high school but when you apply yourself and read these things a couple of times over it starts to sink in.

Right now I know enough to understand what data needs to be gathered and how it needs to be analysed.

But once I have the dataset I’m going to try and find someone much better at this kind of analysis than me and run the regression together.

Because regressions have almost no computational cost we can try several methods other than weighted regression that might work.

At this stage of the process I’ll become less like Steven Levitt and more like Stephen Dubner, if you don’t get this you haven’t read the Freakonomics books (why? there great).

P.S. if anyone knows Levitt feel free to drop him an email and let him know I’ve got a big dataset for him :)

There’s a 97% chance its going to fail

Ok, I just made that number up, but the chances of me getting a model that correlates above .65 (that’s my goal) to the Google algorithm is extremely low.

First, I can’t test things like user engagement, CTR, etc.

Second, their super smart PhD holding engineers have definitely come up with more advanced topic modelling models than LDA.

Third, a couple of much smarter people in the SEO industry have tried similar methods before and gotten models not worth publishing.

Fourth, I’ve talked to some really, really smart guys in the last couple of days, and while all of them were supportive, none of them actually thought I would pull it off.

Fifth, weighted regression and pretty much any other type of regression is going to have its pitfalls.

Why bother

With all the odds stacked against me you are definitely wondering why I would bother spending my free time for the next few months running a project likely to fail.

  • Being the nerd that I am, I actually enjoy this stuff.


  • I set myself the goal and promised the SEO industry I would come up with a model search engine algorithm, so that’s what I’m going to do.


  • If the model fails, it still succeeds in that we can pretty much say that SEOs know almost nothing about the Google algorithm, so they should just be doing some RCS. Plus even if the model fails in reaching its correlation target I should still be able to answer some key questions, like how much social shares actually matter, etc.


  • I will still be running correlations which after Penguin and Panda might prove quite interesting, plus I’ll have the correlation data on the new factors I’m going to test (social!!).


  • That 3% chance of success (or whatever the number actually is) is like gold dust, if the model is successful there are unlimited ridiculously cool things to do with it.

What’s going to happen

I’m going to try not to get to technical or bogged down in the details here:

  1. I’m going to finish rewriting my code from the correlation study (when I first wrote the code I had only been programming for 3 months, so you can imagine how cringe-worthy it is when I look back at it).
  2. I’m (with your help) going to figure out what new factors I should test in this iteration that I didn’t test with the correlations, think social, more advanced topic modelling, anchor text, etc.
  3. I’m going to write and test the code to gather the data for these factors.
  4. I’m going to figure out what keywords to gather data for.
  5. I’m going to split these keywords up by industry and by likely user intent (navigational, informational, commercial and transactional), unfortunately I will have to classify intent by hand (that’ll be a fun weekend).
  6. I’m going to go back to our incredibly, amazing data providers and ask for their support for the project one last time.
  7. I’ll run the scripts and gather the data. Not sure whether I’m going to include Bing here, I’d be happy to do it, if SEOs would find it useful (comment below) and I can get the data required.
  8. I’ll run some fun tests and publish the results. I think it would be interesting to know in what industries universal search is most prevalent, which industries use social the most, how well does the Bing algorithm correlate to Google’s, what domains show up most in search results, what individual URLs show up most in the results, etc.
  9. I’ll run and publish the correlations in the same way I did last time.
  10. I’ll create and publish some useful algorithms that might come in handy for SEOs or for future research e.g. can I create a model that accurately identifies query intent using the data at my disposal and my own evaluations of this intent.
  11. I’m going to find some much smarter people than myself to help me create the model.
  12. I’ll publish the model and the normalised coefficients (which will be most useful in determining the importance of each factor).

So that’s it really, I will do my best to blog about any major steps forward in the project, and I’ll definitely be Tweeting more often (probably the best place to follow exactly where I am with the project).

The Science of Correlation Studies

Lab scientist representing science in search engine optimizationThe initial part of my project to bring more science to SEO is based on doing a correlation analysis on individual potential factors within the Google algorithm.

An incredibly important part of the project is making an impact on how SEOs and webmasters do business, a core to taking action based on correlation studies is to have an understanding of what a correlation study is, what are its weaknesses and what’s good for.

Don’t worry if you don’t know anything about statistics, maths, science, SEO or programming its not required to understand what a correlation is. Heck I didn’t know what it was until a couple of months ago.

I’ll try to keep the jargon to a minimum and hopefully you will be an expert on Spearman (woops that’s jargon) by the end of the post.

A Relationship

A correlation is just a fancy name for a relationship. Basically what I am trying to prove is:

  • Is there a relationship between a factor and ranking well in Google?
  • If so how big is the relationship?
  • And is it a positive relationship (helps rankings) or is it negative (lowers rankings)?

To be technical a correlation is a relationship between two variables.

In our case the two variables will always be the ranking of a page in Google and whatever factor were testing for.

Throughout this article I’ll use PageRank as our example factor.

So if I was trying to prove a relationship between PageRank and ranking in Google I would go to Google take the top 10, 20, 50, 100 whatever number of results in Google and find the PageRank score for each of these results.

I would then calculate if there was a relationship, the size of the correlation and the type (positive or negative).

But how do you calculate that?……

Spearman’s Rank Correlation Co-efficient

Spearman’s Correlation Co-efficient is one of the many maths formulas used to calculate correlation.

For maths junkies the reason I’m not using the others is because Spearman’s doesn’t assume a linear relationship between the variables. That just means that the type of data I’ll be testing may not be as well suited to the other formulas as Spearman’s.

Here’s the formula looks like: Spearman's Rank Correlation Co-efficient (the formula)

I won’t go into what each part means, but if your interested here’s a great video that explains it for stats newbies.

The basic idea is that you feed your two variables into this formula and it gives you back a number between -1 and 1.

What the number symbolises is the actual relationship/correlation:

  • If the sign is negative i.e. between 0 and -1, then the relationship is negative (hurts your rankings) e.g. maybe the page loading speed will show a negative correlation because as loading speed increases, ranking should decrease. And if it is positive i.e. between 0 and +1 then the relationship is positive e.g. PageRank should have a positive correlation.
  • Think of it this way,if one increases and the other increases then the correlation should be positive and if one increases while the other decreases then the correlation should be negative.
  • The closer you are to one, either positive or negative one the stronger the relationship. I.e. a correlation of 0.4 is a stronger correlation than 0.25. And the same goes for negative numbers -.5 means the correlation is negative and it would be worse for your site’s ranking than a -.15 correlation.
  • If you get a correlation at 0 or close to 0 it means there is little or no correlation.

To recap the correlation will be between -1 and 1. Negative number means negative relationship and vice versa. The closer the number is to either -1 or 1 the stronger the relationship. A correlation at or close to 0 means there is no or a very weak relationship.

Example Spearman Calculation

In this example (totally fictitious) we are looking at the top 10 results in a Google search. And we have gone and found the PageRank for each of these results. Below is the table representing the two variables.

Ranking PageRank

#1                    6

#2                    7

#3                    4

#4                    3

#5                    5

#6                    4

#7                    3

#8                    2

#9                    1

#10                  1

When you feed this information into the formula you get: a correlation of 0.894 (remember this is only an example).

This correlation would mean there is a very strong positive relationship between PageRank and ranking well in Google.

What I do is calculate the Spearman correlation for each of the top 100 results for 12,000+  searches and whatever factor I am testing for. I then get the mean (average) correlation, and that’s the one we’re interested in.

Gathering the Data

Binary code and several laptops representing large vats of dataSo we’re going to go a bit of the track here and explain how I gather my data to be analysed.

First thing I did was get myself a list of over 12,000 keywords, 800 from each of the 22 categories on Google Adwords Keywords Tool and remove the duplicates.

Then I wrote a program to go and get the top 100 results for each keyword so approx. 1.2 million URLs. Then I removed Google News, Images, Videos, Maps and Products results because they are generated by a slightly different algorithm and we can’t control them anyway.

Then I choose a factor to test, in this case PageRank. I figure out a way to get the desired data for each factor. For PageRank I would write a program that goes to the web asks a special URL for the PageRank of each URL and store it in a database.

Then I take the PageRank ranking and the search engine ranking and calculate the Spearman for each keyword. I store the correlation for each keyword, bring it into an Excel sheet and get the average.

Obviously each factor requires a lot of thought, some programming and a ton of testing but that’s the basic process.

Weakness – Correlation is not Causation

Having a relationship doesn’t mean a factor is in the Google algorithm nor does it mean that there will be some indirect increase in ranking.

For example there might be a positive relationship between Facebook likes of a page and ranking well. But it is possible (if not likely) that Google isn’t counting this factor in the algorithm and the positive correlation is there because pages that satisfy other factors e.g. links, on page factors, etc. may get more Facebook likes.

There is also a possibility that the factor has a knock on effect. For example the more Facebook likes a page gets, the more people see it, the more normal links it gets, the more PageRank, the higher the ranking.

So it is possible to get a correlation without the factor being in the algorithm.

Another possibly more relatable example is that of a football player. There would probably be a strong correlation between being a professional music artist and going to lots of high profile parties. That doesn’t mean if you start going to these parties then suddenly you will have an amazing voice. This is a classic case of correlation not causation. Going to the parties doesn’t cause you to be a successful singer its just a bi-product.

This is extremely important, correlation is only a guideline. The correlation needs to be viewed critically and tested in other means to be able to say more definitively whether it is a factor in the great Google algorithm or not.

Counter Argument – SEO’s not About Causation

From a scientific point of view this is a flaw with the correlation study. It still holds serious scientific significance but if your goal is to prove whether a factor is indeed a factor in the Google algorithm then correlation studies are only a very good guideline.

But if your goal is to increase the ranking of a site or find out what it takes to rank higher, which is what SEOs are interested in then correlation studies have greater value.

In the Facebook likes example where the increase in Facebook likes didn’t directly cause an increase in ranking it affected other factors which did, does this mean as an SEO you should ignore Facebook likes?

No, in this case it means that you should try and increase your Facebook likes to cause that knock on effect.

So really it depends what your goal is.

If you are trying to figure out what it takes to rank high in Google then they are very good (but not perfect). If your goal is to prove whether it is a factor in the Google algorithm, their fairly good but of less value.


I hope I did a good job at explaining correlation and that you understand it better now. The main takeaways are that correlation is code for relationship between two things e.g. ranking and a factor such as PageRank. That the correlation is between -1 and 1, with negative numbers meaning a negative correlation and vice versa, and that the closer you get to either 1, the stronger the relationship.

And then that correlation results should be looked at as a great guide for how to do SEO and what’s important to focus on but not the holy grail and should be tested further.

I believe this type of science and hard hitting statistics in SEO will vastly improve the industry by providing credible data and information that’s far more reliable than guesses or observations over one or two websites. I’m looking forward to sharing with you my findings and seeing what interesting tests arise from the correlations.

What I Learnt About 342,740 Domains

I recently did a parse of 12,573 keywords, extracting the top 100 results per keyword on Google. And after cleaning up the data I was left with over 1.2 million web pages and 342,740 unique domains.

For the last week or so I have been looking for interesting information within this mountain of data.

I published data on top domains, sites, Google’s use of Images, Products and News results and some strange URLs I noticed.

This data is part of my project to bring more science to SEO, initially by doing a correlation study into Google’s algorithm.

Domain Data

I was looking into domain related data and I spotted some interesting patterns, nothing ground breaking but just some stuff you might find cool.

I should have some domain related correlation data out later this week so this is a insight into the domain dataset.

The domain name and the ending for that domain is a really important choice for a new webmaster. When you make that choice your choosing your brand for life, plus its a super important decision from an SEO point of view.

There are a couple of considerations to take into account. The user and the search engine. The correlation data I’ll show you later this week should take care of the SEO point of view.

But from a user’s point of view you obviously want a domain that’s memorable, easy to type and easy to link to. And that in and of itself is a big factor in SEO. If people are linking to the wrong domain you’re losing out on valuable link juice. If people can’t type it then you’re losing out on type in traffic and if user’s can’t share it then say goodbye to some social media clicks.

Domain TLDs

The domain ending or TLD (Top Level Domain) is probably the most important part a website’s address.

There is a technical difference between a TLD and a domain ending. For example .uk is a TLD but is not, it’s technically a subdomain of the TLD. But webmasters and users don’t care about technical definitions, so I’m going to treat them as the same for the purposes of this article.

If you have a really catchy, social, SEO perfect site name it’s useless unless you have the right ending to that great name.

Turns out not that many people are typing

I extracted the domain endings for all the sites in my data and collected a neat list of all the domain endings I could identify in these 1.2 million URLs.

I used a list of all known TLDs from the Mozilla crew (I think?), but I can’t find the link so if you know the list I’m talking about please post the link in the comments section.

Luckily I downloaded and cleaned up the list into a nicely formatted text file so you can iterate through and check for matches if your running tests yourself.

Update: Thanks to Kris Turkaly who left a link to the list in the comments:

After running my scripts and programs through the data it turns out there were 437 different domain endings in the dataset.

Thinking about it that’s a pretty small number of TLD’s for 1.2 million URLs but as you will see there is huge dominance with just 3 of domain endings.

I ranked them in order of number of sites out of the 342,740 that had that TLD. Here’s a handy Excel list of  all 437 TLD’s in that descending order.

And a nice graph of the top 5 domain TLDs:

(You can hover over each bar with your mouse and you’ll get the exact numbers)

It was hard to see some of the smaller TLDs so here’s the next 45 (blown up and zoomed in) and .us repeated again i.e. the 5th-50th most popular extensions. Even within this subset there’s a really huge drop off from the top of the list. Combine that with the top 5 domains and you see a gigantic dominance of the top 2 or 3 domain endings.

Of course this dataset isn’t designed to find the most popular TLD’s but it’s probably a pretty good idea of what users are used to.

Realistically .com and .org are the only global domain extensions you should be going for from a user’s point of view. And even if you own a .org you should be on the lookout for the .com variation.

Domain Length

I thought it would be interesting to see the distribution of domain lengths, so I counted the length in characters of all 342,740 sites without the domain ending.

Out of interest the average domain name length was 14.75 characters long, but the most common length was 8 characters with around 1 in 12 sites 8 characters long.

Again you can hover over each circle for exact stats.

Here’s an Excel file based on the above graph with the domain lengths and number of domains with the corresponding length.


This post is pretty good at showing what users are used to and are likely to accept. Of course common sense is still required. For example if your in Ireland, Irish users are very much at home and used to .ie domian names so maybe grabbing that and the .com domain name might be a good idea.

Again, if you find a really great, catchy name that’s longer than normal then go for it and if you find a short one that ticks all the boxes go for it too.

Hope you enjoyed this post and stay tuned for some correlation data later this week.

Interesting Google Parse Data

I recently completed an automatic searching of Google of 12,573 keywords. I extracted the top 100 results for those keywords, so around 1.2 million web pages and 342,740 unique domains.

This is the dataset I am working off of for the initial round of The Open Algorithm Project.

Over the last two days I’ve been pouring over and cleaning up the dataset and I have seen some really interesting stuff. Things you don’t notice when you do individual searches, even if you dig into the HTML of a search result page.

It’s sort of an eclectic mix of interesting things that I and my programs spotted and thought you might find interesting too.

Most Popular Domains

There were in total 342,740 unique domains. I’ve put together an Excel document with all the domains and the number of times each one showed up as a search result.

You can check if your site is in my dataset and contributing to the project.

Here’s my list of the 50 most popular sites in Google by the number of times they showed up in the search results. That’s the number of search results they accounted for, not the number of keywords they showed up for i.e. a site could show up multiple times for one keyword.

Interestingly the top 50 sites from the parse (which used US based IP addresses) is very similar to the top 50 sites in the US as judged by Alexa.

Note: The counting of includes universal search which accounted for 7,542 results.

View it full size here.

Most Popular Sites

I wasn’t surprised by the fact that Blogspot, Tumblr and WordPress did well in the top domains data. But I figured that a lot of the search results that they were ranking for were on subdomains, that weren’t really under their control. So I went back to the database and this time extracted the results down to the subdomain or domain they were hosted on i.e. the site not the domain. Google are sort of half and half on how they treat subdomains, I suspect they have some algorithmic feature to distinguish between blogspot or style subdomains vs. They could be using more complicated algorithms but I would imagine they are certainly counting the number of subdomains and diversity in terms of content topics across those subdomains as a general guide to what’s associated with a root domain and what’s not. You can download the full list of 383,638 sites with the number of search result appearances from this Excel file. As you can see, compared to the top 50 domains, WordPress, Blogspot, Tumblr, etc. have taken a hit.

View it full size here.

Universal Search

I thought it might be interesting to extract when, where and how much Google shows Universal search results from images, news and products which of course are all Google entities.

Image Search:

Number of times image results showed up (out of 12,573 keywords): 2,327

Modal/most common position: 1

Mean/average position: 8.01

Google News:

Number of times news results showed up (out of 12,573 keywords): 2,195

Modal/most common position: 11 (remember I extracted the top 100 results)

Mean/average position: 9.86

Products Search:

Number of times products search results showed up (out of 12,573 keywords): 3020

Modal/most common position: 2

Mean/average position: 8.8

Strange URLs

Update: It appears I was totally wrong on this one. While these strange URLs were indeed strange and were about a new Google feature. It wasn’t quite a feature I had thought about. Seer Interactive reported a change to the way Google formats its search results that now uses these strange URLs.

Out of the approximately 1.2 million URLs I extracted, I got back 185 really weird ones.

You can view the 185 of them here:

All 185 URLs redirected to what would be a normal search result. There were only a handful of websites that were redirected to from the 185:

Pure speculation but that lengthy parameters (examples in the Excel file) would suggest that Google is for some reason tracking the clicks on these results in some special way, potentially to track and test the reaction to new algorithmic or other type of feature being trialled.

Interestingly the CricBuzz URL redirects to (gl presumably meaning Google) but not all clicks from Google to the CricBuzz site redirect to /gl. It seems as though CricBuzz are tracking clicks from specific types of Google search results (namely, live cricket scores), for what reason other than internal is again only speculation.

All of the results with these strange URLs are in some way unique. Take the Google public data result, there’s a nice embedded graph:

Google Public Data search result

Or any of NCBI results:

Google's treatment of health related search results

I’ve highlighted 3 unique features of this result.

#1 The snippet is pulled cleverly by Google directly from the page.

#2 The special health identifier image to the left of the result.

#3 The links to sections within the article.

Of the 185 strange URLs pulled from the 1.2 million 170+ were from the NCBI site.

It would seem as though this site is being treated differently by Google. Maybe they recognise it as the expert in its field and are trialling some sort of improved indexing of expert health related sites?

A number that sprung to mind was 1%, as suggested by Stephen Levy in his book about Google, In The Plex, about 1% of searches have some sort of trial or something being tested outside the regular algorithm. 185 out of 12,000 searches isn’t far off 1%…food for thought.

Of course it could be something totally different, maybe a Googler or SEO can shed some light on the intriguing results.

I would be interested to hear if other people doing Google search result parsing over a long period of time have seen these types of results and if they in some way connected to some future feature launch within the search engine?

FTP and IP address

I wrote my programs to handle http:// and https:// web pages only. I had never noticed Google return any other type of page and assumed that’s all that would come back from my parse.

Turns out Google are happy to crawl and index more than just that. I’ve seen a couple of ftp:// pages returned e.g. search for “garmin” on, click through to around page 8-11 and you will see:

In fact out of 1.2 million URLs Google returned 22 ftp:// results, all but one of them were PDFs.

Here’s a list of the 22.

Notably a number of these seem to be meant to be private or only for a company intranet, so its impressive if not worrying that Google has found these documents.

Another strange one was one site returned only as a an IP address with no domain name, check it out:, the site’s domain name is but for some reason Google saw fit to return the IP instead of the domain name. Maybe because the site doesn’t work with www. in its domain name?


Hope you found the post interesting, if you spot anything fun or unusual in the data files I have linked to in this post let me know in the comments section.

Young Scientist 2011


For those of you who don’t know already, the idea for this blog is a result of the BT Young Scientist and Technology Exhibition 2011 (or BTYSTE for short).

I did a project for the prestigious largest of its kind science fair, that has been running for 47 years in Dublin, Ireland.

BT Young Scientist and Technology Exhibition Logo

The project entitled “Investigating the factors of a search engine algorithm” was relatively successful and proved an excellent starting ground for the now larger scale project.

And as you might have guessed I tried to find as many of the 200 main factors the search engines use to determine where any site ranks when you search for something on Google, Yahoo, Bing, Ask, etc.

I used a testing method that has since been called “reverse engineering to Google algorithm” despite the fact that I didn’t know what reverse engineering was before coming up with the testing method. To me it just seemed like the logical way to try and prove certain factors and come up with new ones.

I came up with a list of 157 factors that definitely are, probably are and might be factors and I tried to test to see whether they were or weren’t factors and by default find their weighting in the algorithms too.

The way I went about that was to take an individual factor, let’s say Page Speed (how quick a page loads) and compared the speed a page loads in all of the top 5 results of a Google search compared to the 25th-30th result, over 30 searches per factors.

The results I got back would be an average of how quick a page loads in high ranking sites (they are the ones in the top 5) and an average Page Speed of the sites that didn’t rank so well, low ranking sites in the 25-30 band.

I would compare this average and if there was a difference between high ranking and low ranking then that factor is impacting how that page is ranking in the search engines and as a result it can be confirmed with relatively high certainty that it is a factor and based on the difference between the two types of sites you could determine what the weighting of the factor was when we compared the difference to other factors.

The greater the difference the more impact it was having and therefore the greater the weighting.

Follow all that?

If you did, that’s great, if not your probably in the majority and I’ll summarize it. I created  a testing method that could be used to confirm a search engine algorithmic factor, test a new one and return a result as to whether it was a factor and then I could tell you how important that factor is in the algorithm.

Of course there were a number of problems with this testing method.

  • The project was done over a space of 2 months with time also spent on preparing presentations, report books and a project diary as well as preparing to spend a week in a hall with 500 other projects presenting my idea to the general public. That meant I didn’t have as much time to focus on the testing part as I would have liked. As a result I only tested 20 factors using this method and came up with a list of 157 factors based on intuition and other lists available.
  • I don’t know a programming language capable of doing the testing automatically so I did the testing manually. I am currently learning Python which will allow me to this in the future. Therefore I only tested these 20 factors over 300 web pages, which simply isn’t a large enough sample size.
  • The test was simply not scientific enough, I used no proven formula and took only a small sample size.


Despite all these problems the week at BTYSTE was interesting and inspiring, with people even paying for me to email them the list of factors. Essentially people found my half hearted efforts at a project interesting, and so I figured I would develop the project further.

In the 3 days that I spent presenting the project to the public I talked to a number of SEOs and webmasters and I have forged relationships with a lot of important people that will be able to help me as I continue to better the project.

Following the interest at BTYSTE I have been offered a number of jobs and a Google employee and high level programmer have given me advice on how to improve the project.

As a result I have come up with a fully developed system for as accurately as possible testing search engine algorithmic factors.

All in all the BTYSTE was a great week, I enjoyed talking face to face with fellow SEOs and gained a lot of contacts and experience. It also has driven me on to continue with the project, bettering it and maybe even entering next year.