I recently completed an automatic searching of Google of 12,573 keywords. I extracted the top 100 results for those keywords, so around 1.2 million web pages and 342,740 unique domains.
This is the dataset I am working off of for the initial round of The Open Algorithm Project.
Over the last two days I’ve been pouring over and cleaning up the dataset and I have seen some really interesting stuff. Things you don’t notice when you do individual searches, even if you dig into the HTML of a search result page.
It’s sort of an eclectic mix of interesting things that I and my programs spotted and thought you might find interesting too.
Most Popular Domains
There were in total 342,740 unique domains. I’ve put together an Excel document with all the domains and the number of times each one showed up as a search result.
You can check if your site is in my dataset and contributing to the project.
Here’s my list of the 50 most popular sites in Google by the number of times they showed up in the search results. That’s the number of search results they accounted for, not the number of keywords they showed up for i.e. a site could show up multiple times for one keyword.
Interestingly the top 50 sites from the parse (which used US based IP addresses) is very similar to the top 50 sites in the US as judged by Alexa.
Note: The counting of Google.com includes universal search which accounted for 7,542 results.
View it full size here.
Most Popular Sites
I wasn’t surprised by the fact that Blogspot, Tumblr and WordPress did well in the top domains data. But I figured that a lot of the search results that they were ranking for were on subdomains, that weren’t really under their control. So I went back to the database and this time extracted the results down to the subdomain or domain they were hosted on i.e. the site not the domain. Google are sort of half and half on how they treat subdomains, I suspect they have some algorithmic feature to distinguish between blogspot or wordpress.com style subdomains vs. blog.yoursite.com. They could be using more complicated algorithms but I would imagine they are certainly counting the number of subdomains and diversity in terms of content topics across those subdomains as a general guide to what’s associated with a root domain and what’s not. You can download the full list of 383,638 sites with the number of search result appearances from this Excel file. As you can see, compared to the top 50 domains, WordPress, Blogspot, Tumblr, etc. have taken a hit.
View it full size here.
I thought it might be interesting to extract when, where and how much Google shows Universal search results from images, news and products which of course are all Google entities.
Number of times image results showed up (out of 12,573 keywords): 2,327
Modal/most common position: 1
Mean/average position: 8.01
Number of times news results showed up (out of 12,573 keywords): 2,195
Modal/most common position: 11 (remember I extracted the top 100 results)
Mean/average position: 9.86
Number of times products search results showed up (out of 12,573 keywords): 3020
Modal/most common position: 2
Mean/average position: 8.8
Update: It appears I was totally wrong on this one. While these strange URLs were indeed strange and were about a new Google feature. It wasn’t quite a feature I had thought about. Seer Interactive reported a change to the way Google formats its search results that now uses these strange URLs.
Out of the approximately 1.2 million URLs I extracted, I got back 185 really weird ones.
You can view the 185 of them here: http://www.theopenalgorithm.com/media/2012/02/Weird-Google-Search-Result-URLs.xlsx
All 185 URLs redirected to what would be a normal search result. There were only a handful of websites that were redirected to from the 185:
Pure speculation but that lengthy parameters (examples in the Excel file) would suggest that Google is for some reason tracking the clicks on these results in some special way, potentially to track and test the reaction to new algorithmic or other type of feature being trialled.
Interestingly the CricBuzz URL redirects to www.cricbuzz.com/gl (gl presumably meaning Google) but not all clicks from Google to the CricBuzz site redirect to /gl. It seems as though CricBuzz are tracking clicks from specific types of Google search results (namely, live cricket scores), for what reason other than internal is again only speculation.
All of the results with these strange URLs are in some way unique. Take the Google public data result, there’s a nice embedded graph:
Or any of NCBI results:
I’ve highlighted 3 unique features of this result.
#1 The snippet is pulled cleverly by Google directly from the page.
#2 The special health identifier image to the left of the result.
#3 The links to sections within the article.
Of the 185 strange URLs pulled from the 1.2 million 170+ were from the NCBI site.
It would seem as though this site is being treated differently by Google. Maybe they recognise it as the expert in its field and are trialling some sort of improved indexing of expert health related sites?
A number that sprung to mind was 1%, as suggested by Stephen Levy in his book about Google, In The Plex, about 1% of searches have some sort of trial or something being tested outside the regular algorithm. 185 out of 12,000 searches isn’t far off 1%…food for thought.
Of course it could be something totally different, maybe a Googler or SEO can shed some light on the intriguing results.
I would be interested to hear if other people doing Google search result parsing over a long period of time have seen these types of results and if they in some way connected to some future feature launch within the search engine?
FTP and IP address
I wrote my programs to handle http:// and https:// web pages only. I had never noticed Google return any other type of page and assumed that’s all that would come back from my parse.
Turns out Google are happy to crawl and index more than just that. I’ve seen a couple of ftp:// pages returned e.g. search for “garmin” on Google.com, click through to around page 8-11 and you will see:
In fact out of 1.2 million URLs Google returned 22 ftp:// results, all but one of them were PDFs.
Notably a number of these seem to be meant to be private or only for a company intranet, so its impressive if not worrying that Google has found these documents.
Another strange one was one site returned only as a an IP address with no domain name, check it out: http://18.104.22.168/, the site’s domain name is realtek.com but for some reason Google saw fit to return the IP instead of the domain name. Maybe because the site doesn’t work with www. in its domain name?
Hope you found the post interesting, if you spot anything fun or unusual in the data files I have linked to in this post let me know in the comments section.