Because TheOpenAlgorithm project is a science based project I have decided to follow recommended scientific protocol and keep a log book of my day-to-day activities on the project.
This is very different from my blog, in that it will contain very short, mostly boring logs of my activity for the day, rarely will it be the source of insight or genius but it will be handy for anyone who wants to see what I’m doing and the work involved.
It’s not beautifully written or edited for human consumption but if you want to know what goes on behind the scenes this is it.
I’ll be posting all the insights and findings on the blog.
April 2nd: Ran some scripts calling the SEOMoz API for the extra URL Metrics API calls they added a while back.
Wrote some functions analysing the gathered SEOMoz data to a further degree, looking beyond just the provided data to ratios/relationships between different data points, etc.
April 1st: Running the correlation calculation functions over the Diffbot, on page (HTML) and SEOMoz data.
Finishing off the data and conclusions pages on the GSF site.
Finalising the data visualisation and the spreadsheets.
After 40 hours straight of hard work, with 5 hours sleep, I have finally finished the majority of the required testing and computation and submitted the project to the Google Science Fair.
I am away on holidays from Thursday so I will write up my results and share them on the blog when I get back in approx. 2 weeks.
Some really interesting results, can’t wait to share them with you.
March 31st: Final day of crawling and scraping, I’ll also be simultaneously testing correlations. It looks like I’ll have all the data in by later tonight, whether I’ll have the time to get through it, computing all the correlations is another question, sounds like some late night coding is ahead.
Scripts running as I work on the GSF video. It’s looking quite nifty, but only 1/3rd of the way complete.
I now have all the factors that I have the correlations computed for uploaded and graphed on iCharts.net.
Diffbot and HTML scrapers finished and now preparing for calculating correlations.
Rewrote code to suit the given time frame, the staggering number of databases containing the Diffbot and HTML data to compute the correlations of several factors at once. While it will take several hours to get through the full dataset just for the Diffbot factors, I have it running, I’ll get the HTML ones running now and then head to bed, hopefully there won’t be any errors and when I wake up all the remaining computations will be complete.
March 30th: Up early, made sure scripts got an extra 45 mins of work in, the I went to the office and got the Google + scripts running first as they need the proxies only accessible through the office network. Then I got the rest of the scripts up and running. Hoping the majority of them will be finished by this evening so I can start testing for correlations tonight.
The Diffbot calls will probably take until tomorrow, but if I have most of the other data available I should be working at 100% all night even without all the Diffbot data to test against.
Have to go home so I sent myself the spreadsheet with all the current correlations, in the next couple of hours I’ll get them up on iCharts ready to rock.
I now have the remaining Google + scrapers split into 4, quadrupling my processing speed. The Moz, HTML and Diffbot crawlers are all still running away.
Google + scrapers finished, I also tested whether having a .txt file is correlated with ranking.
Calculated the Spearman for a number of file types visible in the URL e.g. .php, .asp, etc. With the Google + scrapers finished I calculated the Spearman for the number of Google +’s for a web page.
SEOMoz updated their rate limits to 10 calls/second, which is great, so I updated the scripts and got them up and running at the higher speed.
March 29th: Ran a script over all the HTML databases to find any URLs that weren’t crawled, for whatever reason. Collected list into a text files and used existing code to go and fetch them, storing the HTML in a database. The remaining Diffbot functions are still running away as is the SEOMoz API script.
Both of the above should be done by late tomorrow and as I have a full day tomorrow, Saturday and Sunday to work on the project I hope to have another 100 or so factors tested by Sunday night.
While the remaining scripts are finishing I’ll finish off the Google Science Fair site, except for the pages requiring all the data and findings.
Finished another page on the site and still have the scripts running here, the Moz functions are about 15% done with over 130,000 URLs called to the URL Metrics API. The HTML fetchers which are calling the URLs that were left behind in the original crawl are nearly finished. The Diffbot calls are a little behind but when the HTML fetchers are finished by tomorrow I can step up the rate I call to Diffbot and have them done by Saturday afternoon. Then it will be late nights and no sleep till Sunday at 12:00, testing all the remaining factors.
March 28th: In the office. All the HTML fetcher scripts are now finished and I’m currently merging all the HTML databases so as to have one database that I can call to when I am testing the various factors.
Brought laptop home for analyses. Didn’t get rate limits extended, but followed up on the helpful suggestion that I should use post requests to send 10 URLs at a time to the SEOMoz API. Cloned the Moz Github Python code and got it running the way I wanted it to.
Now its just a case of letting rip and hoping it all runs smoothly.
March 26th and 27th: Computer still in office but I called over on Monday evening to check everything was running ok which it was. Worked on the site some more from home.
Contacted SEOMoz about getting my rate limits on the API removed so that I can do my calling in a shorter time-frame.
March 25th 2012: In the office, the HTML scrapers are almost finished and the SEOMoz API scripts have completed close to 1 million URLs since yesterday.
With a number of the HTML scrapers finished I’ve got the Google Plus scrapers and some Facebook scrapers back up and running again.
March 24th 2012: Big rush on for the GSF deadline, hope to finish the HTML and diffbot fetching by tomorrow and start the correlation analysis then. I also need to start calling to the SEOMoz API and really need to have that done by Tuesday to allow enough time for analysis.
I don’t think I’ll have enough time to do the social media factors by Sunday, but I’ll try and if not I can just publish that data after the deadline.
Tested the SEOMoz API again and following what was probably some time for my account to be upgraded I now have access to the Site Intelligence API.
Wrote a basic function to call the URL Metrics API and convert the response into a Python Dict object.
Started running the Moz API script. The API is by far the fastest I’ve called to yet, maybe the fact that its a Saturday helps, but my projection of being finished the Moz calls by Tuesday might have to be moved forward, assuming there are no errors. I also have to check out the other APIs as currently I’m only calling the URL metrics API.
Now have 5 Moz scripts running simultaneously.
Finished the “experiment” page on the GSF site.
Already retrieved 140,000 URLs from the URL metrics call on the SEOMoz API.
March 23rd 2012: Computer still in office, but I called in on the way back from training to get all the scripts up and running again. Most had stopped so I cut the less than critical ones like the w3 validator API functions and kept only the diffbot and HTML fetchers functions that I hope to have completed by next Friday.
March 22nd 2012: Couple of hours working on GSF video. I’m not a great video producer but hopefully the content will override the design.
March 21st 2012: In the office. Checked on the scripts, a number had stopped due to memory issues, so I got them up and running again. One of the HTML scrapers was finished and the others should be done either later today or early tomorrow.
Working on the Google Science Fair video, finished the script.
Got my access to the SEOMoz site intelligence API with 4,000,000 calls at my disposal in the next month, so I got reading the guides and testing out some code. URL normalization seems like its going to be an issue again, but I’m pumped to have access and hopefully get some really interesting data.
March 19th and 20th 2012: Laptop in the office so I didn’t do anything on gathering the data or finding results. But I got a ton done on the Google Science Fair site, with another 2 pages completed and the script for the video started.
March 18th 2012: In the office checking on the scripts that ran overnight, another 100,000 or so URLs downloaded and the social media and diffbot scrapers also seem to be working away fine.
Updated the scientific log to have the most recent day on top so now users don’t have to scroll to the bottom to find the latest news.
Cleaned up the on page factors functions so once all the HTML has been I fetched I will be able to call to the functions and calculate spearman straight away.
Worked on the GSF site and started developing a flow chart of the whole methodology and list of factors tested.
March 17th 2012: In the office checking up on the scripts. They ran relatively well, nearly half way through the HTML fetching and the Google Plus Count fetcher. The majority of the scripts stopped due to a memory issue, so I updated the programs and got them running again.
Followed up with SEOMoz about getting that API access. 13 days until Google Science Fair deadline, so I hope to get a ton of factors tested over this long-weekend.
March 15th and 16th 2012: Laptop in office so I did a bit of work on the Google Science Fair site on the desktop at home. Now done 3 out of seven of the required pages. I need my full set of results to complete the others.
March 14th 2012: Half-day from school today, so I head over to the office to check on the scripts. They’ve all run perfectly for the last couple of days and I now have 200,000+ webpages downloaded, the Google + counts of 100,000+ URLs and the text extracted from approx. 50,000 URLs.
One of the scripts crashed due to a database error, so I create a new database and restart the script from where it left off. The Facebook API fetcher seems super slow for some reason, I’ll investigate that later, the data has taken up 20gb to date.
Split the Diffbot calls in 4, quadrupling the speed at which Diffbot will be called and hopefully moving the results from the Diffbot function into a similar time frame to the other calls/data being downloaded to the hard-drive.
March 13th 2012: Didn’t have my own laptop, so I did some online stuff. I completed two of the 8 pages required for submission to the Google Science Fair and updated the site’s page names. All that’s left for submission is the remaining results, the 2 minute video and the other 6 pages.
March 12th 2012: Went over to the office again, I knew the scripts had crashed so I came up with a neat solution to stop it happening. It seems to be a temporary memory issue, so I implemented 100 second times stops every 250 web pages retrieved and save the database into full memory during that period.
Hopefully this will solve the issues I’ve been having.
March 11th 2012: Without the computer, which is at the office I can’t do much. But I went over in the evening to check the status of the programs.
All had crashed, so I fixed the errors and got them running again.
March 10th 2012: Headed over to my parents office to run my HTML fetcher and Diffbot scripts on the more friendly, faster internet connection.
Looking into getting myself a ton of proxies to do some scraping of social media sharing counts through the various social media APIs without running into limitation issues.
Bought 100 shared proxies from buyproxies.org, after reading a bunch of reviews they seem to be the top provider out there. Time to test them out not and see how we go.
Found a really goodie from the old Facebook REST API link.getstats which actually returns the breakdown between shares and likes and you can also send multiple URLs at one time, sweet!
Set up my proxies and implemented them using random selection in the Social Media API functions file with all my pre-written social media functions.
Then I started running the Twitter and Google + functions and I’m just about to re-write the Facebook function to call to the old REST API. I’ll need to test the maximum number of URLs I can submit per call first though.
Got the Facebook and Google+ scripts running but the Twitter one was taking too long so I had to kill it. I’m leaving 8 scripts running overnight and I’ll revisit the Twitter issue with a new solution tomorrow.
I’m going home and whiteboard a solution to the Twitter now.
I should have 150,000 web pages downloaded by tomorrow evening and around 3 times that number of web pages with their social media data on my hard-drive.
March 9th 2012: Finished downloading and merging all the Link Research Tools reports.
With that data I wrote some scripts to analyse the spearman for 24 factors all based on homepage level metrics. Some of them were highly correlated and I can’t wait to get my hands on the SEOMoz API and comparing some of the domain level to page level social metrics and see if they match up.
Other old school factors were also highly correlated, and there were some surprising correlation on link destinations and their attributes within the homepage, again I am looking forward ton hopefully verifying this correlation with similar findings on the page level data.
By the time I got to bed (6:30am) it was bright again, too busy testing/coding to notice, not a good sign.
March 7th 2012: Ran some tests overnight on the HTML retriever script that fetches the HTML of all pages in the database. It seems to time out/get some sort of block after making on average 700 requests to the world wide web for documents. I expect it is an ISP problem with some sort of rate limiting, so I will wait to the weekend, head over to the office and try the same scripts on a different internet connection.
Wrote and began running a Diffbot collector script, that retrieves the Diffbot extracted text and stores it in a database for each URL in the dataset. It will take a while to run and while I only have one call running at a time, I will leave it running 24/7 for the next couple of days and see where were at, if that’s not enough I’ll send multiple requests at a time.
Finished sending the remaining 100,000 or so domains to Link Research Tools API. The reports should be ready by morning when I’ll collect and analyse them.
March 4th 2012: Bought a 1TB external hard drive to store all the data I am downloading, in particular the HTML of all the URLs and also to have a portable backup of all the data. It cost €150 and considering it has about 5 times the storage of my computer and was incredibly easy to set-up it was well worth it.
Got home a set up 5 scripts that simultaneously scrape the HTML of all the 1.2 million web pages in the dataset and write them to separate databases. This minimises the effect of the time wasted in waiting for a response from the server to provide the page.
Its fascinating to watch the program print out the URL it just retrieved every couple of seconds.
Ran the Link Research Tools recovery script and once that’s done all submit the remaining URLs to the API and download the reports tomorrow afternoon.
March 3rd 2012: Ran some more page name and page path correlation calculations. Finished the remaining URL level keyword usage and other related factors correlation data tests.
Emailed Rand Fishkin about access to the amazing SEOMoz API for link based factors, he had already offered the API but we hadn’t got into the details so we finalised limitations details, etc and I’ll be ready to harvest some data from the API pretty soon.
Answered a bunch of questions and criticisms of the data on Google + and in the comments of the post. Some good comments, some less so valid but again lots of support.
Ran a recovery on Link Research Tools files and wrote a script to get the remaining URLs requiring tests.
March 2nd 2012: Answered a bunch of comments and questions on my correlation data. Some interesting questions and great support.
March 1st 2012: Following and answering questions on the latest blog post, over 300 people have viewed it already. An incredible number of people interact with the post, if you count the number of comments, Tweets and other social media interaction nearly 20% of people who read the post do something with it which is pretty cool.
February 29th 2012: Re-tested more factors. Published post on the science behind a correlation study which is both an interesting post and a handy place to send people new to the site if they don’t know much about correlation studies. The post covers an explanation of correlation, a sample calculation, disadvantages of using Spearman/correlation and how I am scaling correlation to millions of web pages.
Spent a few hours writing a my first correlation data release post. The post looks at 20 domain related factors with four graphs and 2,000+ words. Hopefully people will find it interesting and will enjoy the data.
February 28th 2012: Re-tested a dozen+ of domain related factors with the updated database, mildly different but more accurate findings.
Completed page listing all my supporters and published it.
February 27th 2012: Wrote a neat function that takes three modes of input from pre-written functions and does all the work to calculate the Spearman for that function.
Updated database to remove all the Universal, Google maps and YouTube videos from the database and now I’m updating the Spearman calculations for all the previously tested factors on the updated database.
February 26th 2012: Updated HTML fetcher script to automatically handle exceptions with things like too many redirects, connection errors, page not found, etc.
The data downloaded is looking like it will take up around 60gb of space on my hard drive which is around 40% of my disk space, so I’ll have to get me hands on an external hard drive. I currently have a 80gb one but my computer can’t find it and even with the many tutorials found online it still won’t set up properly.
Cleared out 25 gigs on my local disk.
February 25th 2012: Updated some old functions,spotted a hidden error in the name string extractor functions that was extracting the full path after the domain name vs. the path minus the page name which it should have been doing. Fixed that and then re-ran some of the tests I ran yesterday as they would have been impacted by this error.
Also I collected a list of all the URLs I extracted and then removed duplicates which will allow me to be more efficient in my calls to Diffbot and other APIs because I won’t be repeating calls for the same URLs.
There were 1,241,235 URLs, but when I removed the URLs to appeared more than once I was left with 1,024,033 URLs. This means that if I am using an API I will only need to make 1,024,033 calls vs 1.2 million which is a huge saving.
Wrote a short script to get the HTML of all the URLs and save them to a database. It will probably take a week or so to run because the HTTP requests library has to go and get the web page and the HTML of that web page and then store it. Testing my computer to the max here and the relatively slow internet connection probably won’t help but thankfully the script doesn’t seem to stop me from working on other stuff on the computer which means it won’t impact me from doing other work.
February 24th 2012: Using code from yesterday and some reworking I tested for 16 factors and recorded the Spearman for each factor. Most of them are related to keyword use in the URL from page names to keywords in a subdomain.
Significant amount of tests run today, I’m finished testing the domain/URL related factors now I’ll move onto more dynamic factors to test for.
February 23rd 2012: Wrote scripts to calculate Spearman for 10 factors mostly relating to hyphens, underscores and numbers in URLs and their various parts e.g. the page name.
Ran scripts and recorded data, interesting correlations for hyphens and numbers in URLs.
February 22nd 2012: Sent more requests to the Link Research Tools API.
Had a very good hour long interview/chat with Dr. Matt Peters, the in house data scientist at SEOMoz.
Calculated Spearman for page length in characters.
February 21st 2012: Published the domain data blog post with information on domain TLD’s and average lengths in characters.
Calculate Spearman for length of domain name, number of hyphens in domain name, is the keyword the first word in the domain name, number of trailing slashes and the number of characters in the breadcrumbs/name string.
Sent 150 requests to the Link Research Tools bulk link analysis tool. 1,000 domains per request. Currently I’m gathering data on all my base URLs i.e. either the domain or subdomain. Lots of interesting data can be mined using the Cemper/Link Research Tools API.
February 20th 2012: First day back in school after the mid-term, seriously futile occupation, thus didn’t get a lot done.
Small amount of editing done on the “10 Lessons Learnt” post and started a supporters page to recognise all the people who have given time and resources to the project.
Short script to run through two functions that determine a boolean value based on whether or not there is a partial match between a keyword and a domain name, and another function that determines the percentage ratio of this partial match. Both return False/0 if it is an exact match domain.
Calculated mean spearman for the above two factors.
February 19th 2012: Touched up the “10 things I learnt” post and tested out some reports on the Link Research Tools API. Contemplating just running the program using their GUI seen as the reports are merely initialised via the API and you have to go to your account to get the results.
This poses a problem in terms of scalability and also the programmatic problem (although not too difficult) of running scripts through their data files after download.
It would have been nicer for them to have an API you could just call and get a response to but its still pretty cool and very good value.
February 18th 2012: Worked on the methodology for requesting Link Research Tools API and wrote it up on my whiteboard. Dug deep into their literature and I’ll sign up for a free account just to test my knowledge and iorn out any bugs because you get 250 queries/month to the API but you can send 1,000 URLs per query. So each query is very valuable and I don’t want to waste any of them.
It’s a really good API although the documentation couldbe slightly better. There are a lot of neat functions and calls I can make that will save me hard coding them. For example things like PageRank or Twitter/Facebook share counts can all be called through their API which is very handy.
Wrote the scripts and calculated Spearman for exact match .com, .org and .net domains and for just being a .com, .org, .net, .us or .info domain.
Furthered the domain related post and included the distribution of domain lengths. The average domain name was 14.75 characters long and the modal length was 8 characters.
Calculated Spearman for hyphenated exact match domains and .co, .org and .net hyphenated exact match domains.
Started post on 10 lessons I’ve learnt from the first parse.
Finished post on 10 lessons learnt, I’ll polish it up and check grammar tomorrow, its 4am now so time to get to bed.
February 17th 2012: Extracted the top domain TLD’s and ranked in order of number of sites in the dataset with them as their ending.
Started writing a post, graphs and all on these findings and other information on domains and TLD’s.
Wrote some scripts to store the domain ending data in the database for easy and further analysis.
More comments and emails to respond to.
February 16th 2012: Wrote some more scripts to analyse the dataset and read up on the Link Research Tools API. Had a good few emails, comments, etc. to respond to so spent some time doing that.
Did my first Spearman calculation on the data. Exact match domains. I’ll be doing further calculations and publishing the data in a post later in the week.
February 15th 2012: Finished off the blog post with some data analysis of top sites in the parse (as opposed to domains). Fiddled with some old scripts from the domain analysis and got the site data from the database, exported to Excel and some more beautiful data visualisation in Tableau.
Totally revamped the about page, now much more up to date and hopefully more interesting. Pretty much everything but the structure was changed, removed one video and included a handy Spearman example on the correlation between PageRank and ranking.
Published post and got an amazing reaction considering it was my first proper one to the blog. A number of top SEOs responded and a couple of hundred people had viewed the post in the first 2 hours.
Rand Fishkin CEO & Founder of industry leader SEOMoz got in touch and offered free access to their comprehensive link data API and a chance to talk with their in house SEO scientist who ran a similar study.
Bounced a few emails around with Matt (the SEO scientist) and arranged for a Skype meeting sometime next week.
Turns out that the strange URLs I spotted in my Google parse were in fact Google’s new SERP result URLs as spotted by Seer Interactive. Got in touch with Chris Le at Seer about it and sent over some questions on how to update the parser script and write a post about the new URLs.
February 14th 2012: Programmed that solution from last night and spent the next 3 hours looking over the parse data and doing further cleaning. 22 of the 12 million URLs were hosted on ftp://, a whole bunch of them redirected through www.google.com/url=, with a whole bunch of parameters in what seems to be an effort to track these results/clicks.
Lots of other interesting data/anomalies being spotted, recorded and sometimes cleaned out of the database. Its important to get this dataset exactly right because it forms the basis for all future results.
Started writing a post on some of the interesting data I have found and what some of it could mean.
Created some short scripts to extract positions and total occurrences of Google news, images and product searches as part of universal search. Some small mathematical analysis and some Tableau graphs later and the blog post is nearly done.
All that’s left was to write a script that looks at the top domains and subdomains (separately) extracts all their occurrences and where they ranked each time they showed up and then stores that in a shelve database. Then I wrote another script that totted up total occurrences and found the top sites in the parse.
Some more graphs later and the blog post is full of interactive charts, graphs, etc. Completed blog post, with a graph of the top 50 domains by # of search results and Excel sheet of all 342,740 domains.
Caught a simple error and cleaned the database: for a lot of the keywords the was a h3 tag of class r containing related searches/results at the bottom of the SERP, that’s no good to me so these results were eliminated from the data set.
February 13th 2012: Playing around, testing and checking the Google parse data. Overall it seems the parse did a good job.
The program found it impossible to count to 100 with Google erratic counting and didn’t distinguish between different types of results it just kept everything of h3, but I knew that when I programmed it and until Google explain how they count search results we won’t know any better.
Just cleaning up database files now, plenty of errors with weird results e.g. Google returned a web page with no domain name, just the IP address as the host, which confused my strip to base domain program to no end.
Spent an hour working on a program to extract data I wanted to analyse like if the result is a google.com URL and if so what position is it ranked, what domains rank most and what’s their mean position, etc.
Then just as I finished it up, I came up with a far more efficient and effective solution which I’ll programme in the morning.
February 12th 2012:
- 4 functions that integrated with the Diffbot API, that extract the raw text on a page, the server response time and any images or videos in the content area of the page.
- 3 beautiful functions to interact with the w3 standards API to check a web page’s HTML error and warning count and check its doctype.
- Wrote a function that feeds off a previous function that reduced a URL to that URL’s base domain including any subdomain, the new function removes down to the base domain name without the subdomain.
- Checks whether a site is in the Dmoz directory or not.
- Checks whether a given keyword is an exact match for any of the words extracted by the Diffbot API, another function checks for partial matches e.g. “manager” is in “managers” but the previous function would return False.
- Two keyword density functions on the back of a Diffbot call, one calculating the exact match keyword density i.e. what percentage of the words on the page are X, and another that includes plurals and variations of that word.
- Counts the number of words on a given page.
- Checks for the presence of images or videos on the page.
- Calculates the percentage distance to the first exact and partial keyword matches.
- Counts the number of HTML characters on a page.
- Finds the text to HTML ratio of a page.
- Percentage ratio of W3 warnings and errors on a page to HTML and the same ratio to the number of characters extracted by Diffbot’s Article API.
- In the process of writing a function that checks for the presence of bad words from the “official list of Google’s bad words” I caught an error with the HTTP requests library where it doesn’t recognise some ascii character returned from Diffbot. Put in an exception that switches to Urllib2 if this error is returned.
I also completed the Google and parse, so during my Midterm week off I can get cracking on running some tests.
Installed Python on another laptop in the house so I have double computing power as a back up in case I don’t have the power/speed on my machine to run 1 million+ web pages through a couple of hundred tests.
Febuary 8th 2012:
Wrote some more functions:
- Gathers all links in the body of the page.
- Extracts the outbound links from all the links gathered.
- Calculates the ratio between the number of nofollowed outbound links and total outbound links.
- Returns ratio of outbound to total links.
- Extracts anchor text of all links on page and all outbound links on page.
- Checks whether a given keyword is in outbound, internal or all anchor text.
- Calculates ratio between the number of occurrences of a keyword * length in characters of that keyword and the length in characters of internal, outbound and total anchor text.
Febuary 5th 2012: After a couple of days off I’m back into the swing of things. I cleaned up the Google parser and removed some nice features such as random user agent strings and checking the time to increase the efficiency of the program because my computer simply isn’t powerful enough to do a full scrape and include those peripheries in a reasonable time frame.
This is a bit of a pity from the test’s point of view but won’t significantly impact results. In future tests, when I have access to more powerful computers I’ll be able to test and use these factors.
Wrote three functions to identify if a URL is a blog hosted on wordpress.com, blogger.com or tumblr.com. This will allow me to test which types of blogging software Google likes with some degree of accuracy. Obviously the types of blogs/sites on each platform is different but it should give us an overall idea of what Google likes. Unfortunately this doesn’t test for self hosted sites but that might be added at a later date.
Febuary 2nd 2012: Crawl went well last night, pity the internet connection went down half way through. For some reason this loss of connection caused the database holding the keywords and the data to become corrupted.
Asked question on StackOverflow, no resonse.
Febuary 1st 2012: Wrote a clever Google Plus function to extract the number of Google + a URL has. I got around their API limits by calling there + button counter URL: “https://plusone.google.com/u/0/_/%2B1/fastbutton?count=true&url=”.
Spent a couple of hours debugging,probably bugging again and debugging the new bugs on the Google parser. Didn’t really get anywhere other than around in a circle but there seems to be a proxy authentication issue as the proxy doesn’t seem to be recognising my username and password or it doesn’t understand I’m trying to pass the authentication to it.
After 4 hours of frustrating testing, trying, Google blocking and proxy authentication I have finally got parsing Google with success. There were a few minor syntax bugs that during testing didn’t cause a problem, but you could tell from the results returned something was up.
But all fixed now, leaving the computer running in parent’s office overnight, so hopefully by tomorrow morning it will be done.
Got myself 10 new proxies so that I can crawl Facebook, Twitter and Google + APIs without any limitation issues.
January 31st 2012: Completed programs to pull the shares/likes of a URL and the # of Facebook comments for that URL via the Facebook Open Graph.
Required a bit of a workaround to convert Facebook’s returned data into a dict in Python so that I can quickly call the data I actually want. Implemented some error catchers but I have the feeling Facebook are going to provide some unique errors I haven’t accounted for.
Finished a slightly complicated Twitter API call function that calls the Twitter API for a URL’s Tweet count. But due to some poor URL normalisation by Twitter I had to integrate some basic URL normalisation into my program so that I get decent results from the API.
January 30th 2012: Got full access to the AlchemyAPI today, 30,000 queries per day and their excellent support team is going to try and bump that up to 50,000. So I played around with that for a while and tried to figure out how best to use my queries.
Their vast number of services most of which could be tested for in some form or another left me wondering if there were competing services that do a similar job that would allow me to maximise the Alchemy queries while getting some of the comparable services elsewhere.
That’s when I came across Diffbot which by my reckoning and small testing seems to be the best text extractor API out there. Contacted support about getting educational access.
Mike from Diffbot got back to me and he kindly agreed to give me access to 1.25 million calls to their article API. He seemed like a good, very intelligent guy who was interested in the project.
I had some fun on the Facebook Open Graph it seems very easy to use, but whether the reliability of the data is great I’m not sure. Some of the data I pulled looked a little dodgy and I had only made 30 or so requests. I had been warned about some bad data from FB’s API and it seems like I may have to use proxies to gather useful data in short periods of time.
Contacted Stephen from www.getmelisted.net he’s one of the guys who help me out for free answering Python/programming questions I have. We chatted about my Google scraper and debugged the program to make it more efficient and less cumbersome, just some minor changes that might speed up performance.
One change I made was switching the proxy connection to come via the script through the http requests library vs. via my LAN settings as previously tried.
Wrote two functions to identify the most basic form of rel=author and rel=me markup on a page. Two more functions that return Boolean values based on whether the given URL contains the <noscript> or <noframes> tags.
Completed function to integrate with AlchemyAPI, requesting content language for a URL and returning the language string if available otherwise returning None. Also wrote one to return the calculated overall sentiment of a page.
January 29th 2012: Diagnosed scraping problem, it seems to be a proxy problem, either its not doing its job or I was given an incorrect proxy IP. I checked my billing status and despite attempting to make 4,000 connections it only billed me for 65 which means something happened to the proxy connection after 65 attempts. But seen as there was no apparent error from my side it must have been a bad proxy or something like that.
Found a great app (http://omnidator.appspot.com) that will help me parse Schema.org data from sites and check whether that correlates with ranking. Contacted owner about usage limits. Still in search of a similar app for earlier microdata formats that are still used and supported.
Wrote two functions to determine the length in characters of the full HTML of a page and the length in characters of the <body> of the page.
Downloaded a list of 450+ words Google deems “bad”, I’ll test to see if using any of these on your pages has a harmful effect on ranking.
January 28th 2012: Here running the Google scrape as we speak. All seems to be going well although after implementing a time checker and recording the time for every search I do and also entering the user agent string for all searches done in the database the time it is taking to run is longer than anticipated.
At current rates it will take approximately 11 hours to run the full scrape. But its worth my Saturday because this scrape will be the basis for all the data and conclusions.
Wrote a number of functions:
- Checks for the presence of the rel=canonical tag and returns boolean value.
- Extracts the rel=canonical desired link if available.
- Three functions to return True or False based on whether the canonical tag contains www., https:// and/or http://
- The following functions for h1, h2, h3 and h4 tags: returns the contents of all the requested heading type tags on the page, checks if a given keyword is in any of those tags, returns the position in terms of closeness to top of page of a given keyword in the requested heading type (h1,h2,h3,h4), returns the number of the requested heading types on the page i.e. how many h1 tags are there on the page and the most interesting test in my opinion is the percentage ratio between the number of occurrences of the keyword in the heading*length of keyword divided by the length of all of the given heading tags on the page.
- Three keyword/image related functions all return boolean values for whether or not a keyword is in an image src, alt or title on a page.
Computer storage running out as a result of the large amounts of data I am downloading, I’m going to have to buy an external drive.
Ran 4,000 keywords through the scraper and all seemed fine, my progress reporting seemed stable and in working order but when I checked the database later, after the first 200 keywords, it seemed to stop writing the results to the database.
As of yet its unclear what happened, I’m looking at a couple of possibilities and trying to plug any problems in the program but it doesn’t appear to be programmatic because it worked for the first couple of hundred and all through the scrape it printed what it was supposed to.
January 27th 2012: Ran some testing on the Google parser, all went well on about 400 keywords tested. Still problems with ISP so I’ll have to run the full scraper tomorrow.
January 25th 2012: Downloaded 17,600 keywords via the Google Adwords Keyword Tool. That’s 800 for each of the 22 categories, I downloaded into separate CSVs with the columns competition, global monthly searches and local monthly searches (USA) in exact match search mode.
I then combined the CSV files into one file using a Command Prompt trick and removed the  around each keyword and removed rows containing duplicate keywords. Following the cull and a number of factors not quite providing those 800 keywords there were 12,579 keywords left.
The distribution of those keywords by Global Monthly searches:
Been testing and researching all day, mostly trying to figure out how best to store 12,579 search results and easily add data points for each result. I estimate that I will have around 300 million data points in my initial research study so efficient storage and lookup is crucial.
The best solution I have come up with is Shelve which seems like an easy to use library that stores the data is a dict like manner, which is exactly what I want.
January 26th 2012: Implemented user agent switching into my Google parsing script. So now every time I do an automatic search the program randomly chooses from a list of 11 user agent strings.
I chose the 11 as one from each of the top web browsers as judged by http://user-agent-string.info. If a browser had a second or third version e.g. IE 7 and 8 in the list then I used two strings from Internet Explorer.
This should make my parsing look even more human to Google.
Signed-up to the TrustedProxies Google Extractor proxy plan. I tested the service with the free trial and I was pleased with the results. They seem like a good company with decent customer support and relatively low pricing. Plus the proxy was more than easy to use, just requiring basic HTTP authentication.
Created a database of Python dicts containing a keyword and its global and local monthly search volumes. These dicts will then hold the top 100 results in Google for each keyword either by tomorrow or Saturday when I get parsing Google.
I used Shelve to create the database, it was extremely easy to use and very easy to learn how to use, as essentially it is the same as creating dicts in Python, except storing them permanently in a .db file.
While I didn’t get too many new ideas for factors to test, it was useful in seeing how Barry himself gets ahead of the news curve and can predict likely Google action.
For example Barry mentioned how Matt Cutts mentioned on Twitter his anger at content farms a few months before the initial Panda update and other little tips for picking up stories and ideas like that.
January 24th 2012: got back to Barry Schwartz with a time to do our interview.
Updated all the functions I had written to make them case agnostic i.e. making everything (URL, keywords) lowercase. Python treats strings in different cases as different strings, so that potential error is fixed.
Wrote a few title tag and on page factor related functions:
- Fetches HTML of a given URL.
- Another that retrieves the contents of the title tag from the HTML as a string.
- Returns True or False based on whether a page has a title tag or not.
- Gives back the length in characters of the title.
- Checks if a given keyword is in the title tag or not.
- Checks if the keyword is the first string in the title.
- Calculates the percentage ratio of keyword use to title size in characters.
- Returns the proximity of the first occurrence of the keyword to the start of the title in characters.
- Counts the number of keyword occurrences in the title.
- Checks for a meta description tag on the page.
- If there is one, returns the contents of that meta tag.
- Returns the length of the content of the meta description tag in characters.
- Checks if a given keyword is in that description tag.
- Calculates the proximity of the first occurrence of a given keyword to the start of the meta description tag in characters.
- Using that data I wrote another function that returns a boolean value for whether or not a given meta description starts with a given keyword.
- Checks for the presence of a meta keywords tag.
- Two length related functions for the keywords tag, one that counts the number of keywords in the tag and another that counts the number of characters.
- Two checks as to whether a given keyword is in the meta keywords tag, one checks against a string of the keywords tag, so if plurals or variations of spelling are present it will return True. The other checks against the list of keywords in the meta keywords tag and if one element in the list is the same as the given keyword then it returns True, otherwise False.
January 23rd 2012: interview with Kevin Gibbons founder of SEO Optimise. Some interesting thoughts one in particular around the types of links you receive e.g. directory, news, etc.
Also David Davis from Redfly sent back his answers on some questions I emailed him on the project and factors worth testing.
Both seemed pretty positive towards social influence, which means I have to contact Gnip for about the 5th time. They both also seemed relatively sceptical about the value of correlation studies beyond the major factors, that may be because of previous studies, misconceptions or because many have been heralded as fact as opposed to an excellent guide.
Some tough factors like CTR in search results and personalisation factors were mentioned, but its unlikely that I will have the resources to gather this data in the initial study, but long-term they are both excellent ideas.
The great guys at Link Research Tools sent over my API details, which they are providing for free to help out with the project. They seem to provide some very interesting data, I’m particularly interested in their industry and city specific link profiles.
That’s Gnip out, didn’t realise it was purely for realtime data, I’ll have to go directly to the APIs for historical data, but from the helpful Gnip customer support service it appears that the number of queries I am requesting will likely be extremely high. Might have to look at using proxies to get around API limit traps, which isn’t something I want to do but might be required.
Called AlchemyAPI to follow up on previous email, left a message.
Wrote 3 subdomain related functions:
- Checks is there a subdomain in a given URL.
- Counts the number of them e.g. http://level3.level2.level1.domainname.com = 3.
- Checks if a given keyword is in each subdomain level and returns a dictionary containing the subdomain levels and whether the level contains the given keyword or not.
Caught a big error with my function that strips a URL down to the root domain. It ignored subdomains and left them included with the root domain result. I had used the function in so many other functions that it was best the simply write a fix and implement it in the effected functions vs. fixing the original problem and rewriting all programs that used the function as it was.
January 21st 2012: Developed a number of functions mostly looking at URL/domain/page level factors like length and characters/keywords within those areas:
- Function that splits a given URL and takes out the page name as a string. Handy for use in other page name related factors.
- Is the page a query/dynamically created page, function that checks if there is a “?” in the page name.
- Is there an underscore in the page name.
- Is there a hyphen in the page name.
- Extracts all characters after the first “/”
- Checks for hyphens after the first “/”
- 3 functions to check whether there are numbers in domain, page name or after the first “/”
- 2 functions to check if a URL contains either https:// or www., particularly interested in seeing if having an SSL certificate i.e. “https://” in your URL impacts ranking, it will also be interesting to see if this is related to the type of query, for example if it was a transactional query https might be more important than a content related one.
- Count the number of characters after the first “/” after the domain name.
- Checks if a given keyword is in a web page’s name and another to return the percentage ratio between the given keyword and page name. These functions include checks with hyphens and underscores as separators.
- Checks whether the keyword is present after the first “/” after the domain name.
- Finds the page level at which a keyword is present and returns a list containing the numerical values of those levels e.g. keyword = hello, test.com/page/hello/hello123…. would return [2,3]
January 20th 2012: After a day off yesterday, I’m back looking forward to getting a ton done over the weekend.
I chatted with Christoph from http://www.linkresearchtools.com via Skype and he kindly agreed to give me free access to their API at 200,000 domains per month for two months. Hopefully I will be able to condense my queries into a shorter time frame but it was a great gesture and that combined with a number of other companies who are willing to provide free access to data, APIs, services and expertise is really heartening.
In addition Christoph had some interesting ideas on factors worth testing for, in particular testing whether a domain is likely a brand name or not. The Alchemy API I mentioned before may be able to cover this but early tests suggest mixed results so I’ll look around for other options.
In addition Christoph stressed the importance of Google’s separation of sites into particular categories, I had already planned to look at the site’s category as a potential ranking factor and down the line looking at differences in industries over a number of factors.
But it seems likely that I will have access to the necessary data and hopefully time to compare SERPs and types of domains by industry so I plan to publish that data in my first round of research.
It now seems that my research will have to be a more encompassing study looking at the various different Google algorithms based on different markets and countries as opposed to looking at just google.com results. Whether this is possible to test for and whether it fits into scientific guidelines remains to be seen.
January 18th 2012: Contacted a few dozen top SEO experts requesting an interview with them to discuss the project and what potential factors would be worth testing, two positive responses within 10 minutes of delivery. A further two positive responses already, some big names too (I won’t publish the names until they have completed the interview for privacy reasons).
Fixed the proxy bug with the help of http://dcnetworks.ie, as suspected our ISP (Imagine) was blocking the proxy so as a temporary solution I am using a 3G connect card to run tests via proxies and when I want to run large data collection I will have to connect to another network, most likely my parents office network.
Developed six more domain name related functions:
- Checks whether the keyword is the first in the domain name (doesn’t count exact match domain names).
- Counts the number of hyphens in the domain name, returns a numerical value to this effect.
- Counts the number of characters in a domain name (not including domain TLD or start e.g. http:// or www.)
- Checks whether the homepage canonical (canonical function not implemented yet) ends with a “/”
- Finds the depth of a page on a site, this function counts the number of trailing “/” after a domain name (doesn’t count homepages) thus determining how deep on the site the page is.
- Calculate the length of the page name. By looking at all the characters after the last “/” after taking away any trailing “/” you can calculate the length (in characters) of the page.
January 17th 2012: fully debugged the domain TLD function. Now the script generates a Python list from the .txt file previously containing the list of all domain TLDs, making the script more reliable.
Sent email to Adwords API support as they still haven’t approved my request, they responded looking for more information on what I’m going to do with the API and I emailed back.
Completed two functions which check for an exact match domain and a hyphenated exact match domain.
You provide both functions with a keyword phrase and a URL and it returns either “True” or “False” depending on whether the domain is exact match or not and whether the domain is an exact match when the keyword phrase is hyphenated.
Developed two derivative functions from the exact match function one which checks for a partial match and another that calculates the ratio of that match to the overall length of the domain, mitigating against any disproportionate results.
Signed up for the SEOMoz free API, the best API I have come across yet, so easy and simple to use as well as providing great data.
January 16th 2012: Looking at how to decide what the domain TLD (ending) is. The majority of examples online seem to do a search against a list of domain TLDs like this one. But it would seem more reliable to strip each URL of everything after the first “/” and everything before the first “.” and then compare the remaining characters against the same list, if no match was found you would simply move on to the next “.” e.g. this would happen with a subdomain.
Here’s an example of how my system would work:
http://science.theopenalgorithm.com/not-a-real-page would become:
http://science.theopenalgorithm.com (now lets takeaway everything up to the first “.”)
.theopenalgorithm.com (now checks against our list of TLDs, no match, onto the next “.”)
Lovely idea in theory, needs some coding to complete…
Loving this: http://www.alchemyapi.com/ for language detection and page categorization and some other potential factors.
Just finished a lovely little Python function to determine the domain TLD (ending) of any domain. Basically you feed it a URL it strips away any http://, https:// or www. and also anything after the first “/”, then strips the URL of any white space. Then it takes all the characters after the first “.” and cross checks against a list of 4000+ domain TLDs.
Split the TLD function into two functions one which strips the function down to just the domain name without www., http:// or https:// and the other still finds the domain TLD.
Working on but not completed a function to decide if a URL was hosted on a domain with an exact match to a given keyword/phrase.
January 15th 2012: Contacted Gnip about hopefully getting free access to their social media API which brings together many social signals and data points that I’ll use to determine correlations. I had read in the past they give free access to worthwhile projects, so fingers crossed.
They seem to be the industry leaders, and learning to deal with only one social API versus Twitter, FB Open Graph, Google +, etc would be time-consuming and my little expertise with APIs would probably lead to less than perfect data. In addition they have partnerships with a number of social media leaders and judging by some of my past evidence with the Twitter API the data I will get should be far more reliable than the raw APIs from companies that’s main goal is not building user-friendly APIs.
Now looking at requesting PageRank values from Google, I wrote a script to do this a few months ago, but unfortunately it appears they have updated the URL parameters. You now have to provide a hash value (ch=) when querying toolbarqueries.google.com, which is a bit of a pain but a fun challenge, looking for a compatible script to produce my hash values and then I’m all set for calculating PageRank’s correlation.
PageSpeed data looks much easier to gather, the PageSpeed API looks comprehensive at a glance. Testing begins…
Damn, the PageSpeed API has a 250 quota per day limit, you can ask for more but 6 million requests in a day is probably not going to be approved :)
The Pingdom API could do the trick, the usage limits are mildly restrictive but seems to be accurate http://www.pingdom.com/services/api-documentation-rest
Just found a beautiful PageRank checker http://pagerank.phurix.net with a very effective Python script for retrieving PR, no APIs, just a couple of clean functions.
January 14th 2012: Trying to make http requests to Google search results via the Big-G extractor, which is essentially a maze of proxies trying to make my SERP scrapping look natural to Google.
After exhausting myself attempting to get through the proxy I posted my problem to Stack Overflow. The only times I could get through to Google was when the Requests Library ignored my proxies and did the request itself. Of course my momentary excitement plummeted when I saw my broadband provider in the posted headers, meaning that I would be getting personalised results to my geographic area.
Somebody is blocking my proxy connection, because it seems to keep timing out. Getting ‘proxy-connection’: ‘close’ back in my headers returned by Requests, it seems that either Google, Requests or Imagine (my ISP) are blocking the connection to the proxy.
Updated my Adwords API application to include a credit card, hopefully that will speed up the application process.
It now appears that the issue with connecting to the proxy has to do with my local settings, I tried configuring the proxy with Chrome and Firefox via my “Lan settings” and no dice, the proxy didn’t ask for my authentication so I contacted support to see what the issue might be, I’m guessing something to do with my network or ISP settings.
11th January 2012: Having only decided to start a proper logbook today, I figure I’ll update it to where the project is at.
This should be one of my longer logs as it covers a lot of what I have done in the last few weeks.
Currently I have over 300 ideas for potential factors worth testing, some are untestable in the near future, others have been tested in similar studies before and many are relatively new ideas and have not been tested in correlation studies. I have them all written down on scraps of paper here in my room and as notes on my iPhone, probably not a great idea but it would take to long to digitize them.
I have devised a relatively complete strategy for how to tackle the problem of scraping Google search results. Using a combination of the Python Requests module, private proxies, time stops, user agent changes, HTTP header customisation and some HTML parsing with Beautiful Soup I am confident that I will be able to parse Google, Bing, etc undetected while preserving the integrity of the results.
I have completed multiple tests without the proxy’s and user agent changes, I received free trial access to a proxy service I am considering today and have played around with it for an hour or so. Having problems with some Request’s and other errors so contacted Stephen Young who helped me out with some of the bugs I was having with proxy browsing but there still remains some errors to iron out.
Website www.theopenalgorithm.com has been set up and is ready to handle the content production and traffic when I publish the results.
Still have doubts as to the theme eDegree that I am using as some of the back-end is shaky and images (particularly important when displaying lots of data) aren’t very well supported. Due to effort of creating and customizing site I won’t be changing in the short run but I hope to change to a more clean theme/skin in a few months time.
Seriously considering applying for The Google Science Fair with this project as no major work has started and it launches tomorrow.