The Future of TheOpenAlgorithm

After an incredible few days at SearchLove I realised I kept getting the same question from SEOs who have been following TOA, “are you still working on that correlation thing?”

Shit, YES, I did take a break to work on some other stuff but I’ve been working my balls off learning SQL, taking courses on writing better code, writing scripts, reading research papers, designing experiments, plus I’ve been working on other cool projects.

Woops, my bad, I guess I broke the cardinal rule of blogging, relationships and just about everything else, “keep your followers in the loop.”

I’ve known where TOA is going next, for a few months now, but over the last couple of weeks I’ve ironed out the details and am ready to start gathering a a massive dataset for the next iteration of the project.

In this post I’m hoping to answer that question and essentially I’m going to out myself to the industry so if I don’t reach the targets set below you guys can hold me accountable.

What’s next

As everyone knows the ultimate goal for this project is to research more causal relationships between factors and ranking in Google and to create a model search engine algorithm.

I’ve done a lot of reading and emailing in the last couple of weeks and it seems like weighted regression using the pointwise learning to rank approach is my best shot at creating a successful model.

I’m sure this all sounds gobbledygook to most people reading this post, because in truth some of this is still gobbledygook to me.

I’ve never taken a stats course, only run a couple of basic multiple linear regressions, don’t really know anything about machine learning and I’ve just finished differential calculus in my last year in high school but when you apply yourself and read these things a couple of times over it starts to sink in.

Right now I know enough to understand what data needs to be gathered and how it needs to be analysed.

But once I have the dataset I’m going to try and find someone much better at this kind of analysis than me and run the regression together.

Because regressions have almost no computational cost we can try several methods other than weighted regression that might work.

At this stage of the process I’ll become less like Steven Levitt and more like Stephen Dubner, if you don’t get this you haven’t read the Freakonomics books (why? there great).

P.S. if anyone knows Levitt feel free to drop him an email and let him know I’ve got a big dataset for him :)

There’s a 97% chance its going to fail

Ok, I just made that number up, but the chances of me getting a model that correlates above .65 (that’s my goal) to the Google algorithm is extremely low.

First, I can’t test things like user engagement, CTR, etc.

Second, their super smart PhD holding engineers have definitely come up with more advanced topic modelling models than LDA.

Third, a couple of much smarter people in the SEO industry have tried similar methods before and gotten models not worth publishing.

Fourth, I’ve talked to some really, really smart guys in the last couple of days, and while all of them were supportive, none of them actually thought I would pull it off.

Fifth, weighted regression and pretty much any other type of regression is going to have its pitfalls.

Why bother

With all the odds stacked against me you are definitely wondering why I would bother spending my free time for the next few months running a project likely to fail.

  • Being the nerd that I am, I actually enjoy this stuff.


  • I set myself the goal and promised the SEO industry I would come up with a model search engine algorithm, so that’s what I’m going to do.


  • If the model fails, it still succeeds in that we can pretty much say that SEOs know almost nothing about the Google algorithm, so they should just be doing some RCS. Plus even if the model fails in reaching its correlation target I should still be able to answer some key questions, like how much social shares actually matter, etc.


  • I will still be running correlations which after Penguin and Panda might prove quite interesting, plus I’ll have the correlation data on the new factors I’m going to test (social!!).


  • That 3% chance of success (or whatever the number actually is) is like gold dust, if the model is successful there are unlimited ridiculously cool things to do with it.

What’s going to happen

I’m going to try not to get to technical or bogged down in the details here:

  1. I’m going to finish rewriting my code from the correlation study (when I first wrote the code I had only been programming for 3 months, so you can imagine how cringe-worthy it is when I look back at it).
  2. I’m (with your help) going to figure out what new factors I should test in this iteration that I didn’t test with the correlations, think social, more advanced topic modelling, anchor text, etc.
  3. I’m going to write and test the code to gather the data for these factors.
  4. I’m going to figure out what keywords to gather data for.
  5. I’m going to split these keywords up by industry and by likely user intent (navigational, informational, commercial and transactional), unfortunately I will have to classify intent by hand (that’ll be a fun weekend).
  6. I’m going to go back to our incredibly, amazing data providers and ask for their support for the project one last time.
  7. I’ll run the scripts and gather the data. Not sure whether I’m going to include Bing here, I’d be happy to do it, if SEOs would find it useful (comment below) and I can get the data required.
  8. I’ll run some fun tests and publish the results. I think it would be interesting to know in what industries universal search is most prevalent, which industries use social the most, how well does the Bing algorithm correlate to Google’s, what domains show up most in search results, what individual URLs show up most in the results, etc.
  9. I’ll run and publish the correlations in the same way I did last time.
  10. I’ll create and publish some useful algorithms that might come in handy for SEOs or for future research e.g. can I create a model that accurately identifies query intent using the data at my disposal and my own evaluations of this intent.
  11. I’m going to find some much smarter people than myself to help me create the model.
  12. I’ll publish the model and the normalised coefficients (which will be most useful in determining the importance of each factor).

So that’s it really, I will do my best to blog about any major steps forward in the project, and I’ll definitely be Tweeting more often (probably the best place to follow exactly where I am with the project).

5 thoughts on “The Future of TheOpenAlgorithm

  1. axi

    3% with +60% correlation? Keep dreaming man!!! you don’t even know what are you talking about… you are more near to 0.00000000000000001% of success than 3%. You know that google inserts noise and other tactics to avoid this kind of correlation prediction?( Amongs thousands of factors which requires a lot of money, mathematical stuff etc to be measured… )
    Why not try to predict the next lottery winner? …

    I’m not trying to discourage you… just some reality please. Don’t confuse people.

Comments are closed.