Interesting Google Parse Data

I recently completed an automatic searching of Google of 12,573 keywords. I extracted the top 100 results for those keywords, so around 1.2 million web pages and 342,740 unique domains.

This is the dataset I am working off of for the initial round of The Open Algorithm Project.

Over the last two days I’ve been pouring over and cleaning up the dataset and I have seen some really interesting stuff. Things you don’t notice when you do individual searches, even if you dig into the HTML of a search result page.

It’s sort of an eclectic mix of interesting things that I and my programs spotted and thought you might find interesting too.

Most Popular Domains

There were in total 342,740 unique domains. I’ve put together an Excel document with all the domains and the number of times each one showed up as a search result.

You can check if your site is in my dataset and contributing to the project.

Here’s my list of the 50 most popular sites in Google by the number of times they showed up in the search results. That’s the number of search results they accounted for, not the number of keywords they showed up for i.e. a site could show up multiple times for one keyword.

Interestingly the top 50 sites from the parse (which used US based IP addresses) is very similar to the top 50 sites in the US as judged by Alexa.

Note: The counting of Google.com includes universal search which accounted for 7,542 results.

View it full size here.

Most Popular Sites

I wasn’t surprised by the fact that Blogspot, Tumblr and WordPress did well in the top domains data. But I figured that a lot of the search results that they were ranking for were on subdomains, that weren’t really under their control. So I went back to the database and this time extracted the results down to the subdomain or domain they were hosted on i.e. the site not the domain. Google are sort of half and half on how they treat subdomains, I suspect they have some algorithmic feature to distinguish between blogspot or wordpress.com style subdomains vs. blog.yoursite.com. They could be using more complicated algorithms but I would imagine they are certainly counting the number of subdomains and diversity in terms of content topics across those subdomains as a general guide to what’s associated with a root domain and what’s not. You can download the full list of 383,638 sites with the number of search result appearances from this Excel file. As you can see, compared to the top 50 domains, WordPress, Blogspot, Tumblr, etc. have taken a hit.

View it full size here.

Universal Search

I thought it might be interesting to extract when, where and how much Google shows Universal search results from images, news and products which of course are all Google entities.

Image Search:

Number of times image results showed up (out of 12,573 keywords): 2,327

Modal/most common position: 1

Mean/average position: 8.01

Google News:

Number of times news results showed up (out of 12,573 keywords): 2,195

Modal/most common position: 11 (remember I extracted the top 100 results)

Mean/average position: 9.86

Products Search:

Number of times products search results showed up (out of 12,573 keywords): 3020

Modal/most common position: 2

Mean/average position: 8.8

Strange URLs

Update: It appears I was totally wrong on this one. While these strange URLs were indeed strange and were about a new Google feature. It wasn’t quite a feature I had thought about. Seer Interactive reported a change to the way Google formats its search results that now uses these strange URLs.

Out of the approximately 1.2 million URLs I extracted, I got back 185 really weird ones.

You can view the 185 of them here: http://www.theopenalgorithm.com/media/2012/02/Weird-Google-Search-Result-URLs.xlsx

All 185 URLs redirected to what would be a normal search result. There were only a handful of websites that were redirected to from the 185:

http://www.ncbi.nlm.nih.gov

http://www.google.com/publicdata

http://www.google.com/finance

http://www.cricbuzz.com

http://www.carlingcup.premiumtv.co.uk

http://www.ncaa.com

http://en.uefa.com

Pure speculation but that lengthy parameters (examples in the Excel file) would suggest that Google is for some reason tracking the clicks on these results in some special way, potentially to track and test the reaction to new algorithmic or other type of feature being trialled.

Interestingly the CricBuzz URL redirects to www.cricbuzz.com/gl (gl presumably meaning Google) but not all clicks from Google to the CricBuzz site redirect to /gl. It seems as though CricBuzz are tracking clicks from specific types of Google search results (namely, live cricket scores), for what reason other than internal is again only speculation.

All of the results with these strange URLs are in some way unique. Take the Google public data result, there’s a nice embedded graph:

Google Public Data search result

Or any of NCBI results:

Google's treatment of health related search results

I’ve highlighted 3 unique features of this result.

#1 The snippet is pulled cleverly by Google directly from the page.

#2 The special health identifier image to the left of the result.

#3 The links to sections within the article.

Of the 185 strange URLs pulled from the 1.2 million 170+ were from the NCBI site.

It would seem as though this site is being treated differently by Google. Maybe they recognise it as the expert in its field and are trialling some sort of improved indexing of expert health related sites?

A number that sprung to mind was 1%, as suggested by Stephen Levy in his book about Google, In The Plex, about 1% of searches have some sort of trial or something being tested outside the regular algorithm. 185 out of 12,000 searches isn’t far off 1%…food for thought.

Of course it could be something totally different, maybe a Googler or SEO can shed some light on the intriguing results.

I would be interested to hear if other people doing Google search result parsing over a long period of time have seen these types of results and if they in some way connected to some future feature launch within the search engine?

FTP and IP address

I wrote my programs to handle http:// and https:// web pages only. I had never noticed Google return any other type of page and assumed that’s all that would come back from my parse.

Turns out Google are happy to crawl and index more than just that. I’ve seen a couple of ftp:// pages returned e.g. search for “garmin” on Google.com, click through to around page 8-11 and you will see:

ftp://ftp-fc.sc.egov.usda.gov/WI/gistoolkit/garmin.pdf.

In fact out of 1.2 million URLs Google returned 22 ftp:// results, all but one of them were PDFs.

Here’s a list of the 22.

Notably a number of these seem to be meant to be private or only for a company intranet, so its impressive if not worrying that Google has found these documents.

Another strange one was one site returned only as a an IP address with no domain name, check it out: http://218.210.127.131/, the site’s domain name is realtek.com but for some reason Google saw fit to return the IP instead of the domain name. Maybe because the site doesn’t work with www. in its domain name?

 

Hope you found the post interesting, if you spot anything fun or unusual in the data files I have linked to in this post let me know in the comments section.

20 thoughts on “Interesting Google Parse Data

      1. Joe Hall

         Oh ok cool….I know if you aren’t using a decent proxy service they will poison your data with un-sanitized results.

        On a completely unrelated note, look at that stupid Christmas avatar I am using. I really should update my disqus settings.

        Reply
        1. Mark Collier

          They seem to be a quality service, I was referred to them by a friend and did some testing myself and it seems to be reasonable data.

          + the data above makes sense and it correlates well with the Alexa top sites which isn’t the holy grail, but certainly a good omen.

          I must get myself an avatar like that.

          Reply
  1. Keenan Steel

    Good stuff, Mark. I do not believe the NCBI results are part of a test, but rather the way that Google displays information that Google pulls and uses to answer. You see the same URL referral string for “define [word]” or “weather [zip code]. Interestingly, this is handled different for a straight-up question like “who is the president of the united states”.

    In the case of medical/scholarly results, I believe that the trigger is uncommon medical jargon and clinical terms. PubMed has a special place in the health research sector; I’m not sure whether Google gives it preference specifically, or if any credible scholarly source could do the same with encyclopedia-like definition pages following a template.

    Very interesting work; I look forward to hearing more from you.

    Reply
    1. Mark Collier

      Hey Keenan,

      Thanks for taking a look at the data and for your interesting thoughts. I agree that these URLs are seen elsewhere and you make a good point that its most likely not part of an individual test.

      But I still get the feeling that they are doing extra tracking here, maybe not to launch new stuff but to refine the old/get feedback.

      The UEFA one was particularly strange because there didn’t seem to be anything particularly unusual with the result, it just seemed to be a regular result (although it did have sitelinks under the title).

      Reply
  2. Anonymous

    Hey Mark – I just ran across your “little” project.  Pretty cool.  More than that, it gets me really excited to see someone your age tackling a project of this size.  Hat’s off to you!  I can’t wait to see what comes of it. 

    Also, thanks for the tip on trustedproxies.com.  That may come in handy at some point when we want to spin up an automated way to check various results.  

    Reply
    1. Mark Collier

      Hi Jon

      The parse was done over the last week, with most of it done last Sunday. Remember that being in 25% of search results means being 1 of 100 results.I haven’t got the data on product searches collected as one. At some point in the future I’ll post the full database or a file containing the info you want and I’ll let you know.

      Reply
  3. Michael Martinez

    I am curious about how you selected the keywords.  If you’re using popular keywords from the AdWords suggestion tool your analysis will be biased toward heavily optimized/competitive queries rather than natural queries.  That was the mistake SEOmoz repeatedly made with their correlation studies.  You need as random a selection of keywords as possible.

    Reply
    1. Mark Collier

      I did use the Google Adwords Keyword Tool.

      I selected 800 keywords (that’s the max given by GAKT but some of the categories didn’t return the full 800) from each of the 22 categories, removed duplicates and that became 12,576.

      I feel I have a relatively fair distribution across query competitiveness although I would have preferred some more long tail queries which I may aim for in future parses.

      Here’s the keyword distribution by search volume and # of keywords in that volume: http://www.theopenalgorithm.com/media/2012/01/keywords-data-table.png

      Reply
  4. Anonymous

    Hey Mark,

    I’ve been thinking about your project some more…

    The more I think about it, the cooler I think it is.  I’d love to load your test data up into my local database and play with it a bit.  Are you willing to share the raw data for the 12M results? 

    If I learn anything interesting from running sql queries against it I’d be happy to share what I learn.

    Reply
    1. Mark Collier

      Really good idea. 

      I’m entering this project as part of the Google Science Fair 2012 (deadline 1st of April). So I want to have the first set of results in by then.

      But after that I plan to run a larger parse of more keywords so I can segregate them by category, intent, search volume, etc. So I and others can run more interesting tests based on these categories and the overall algorithm.

      But I only want to open up the data when I have the time and man power to manage contributions and have a significant enough dataset to open up the data to the public.

      I wasn’t expecting such a strong and positive reaction until I published my initial results so I want to do that first and then start notching it up.

      When I have published my initial findings and test various models I am going to publish a full larger dataset and smaller datasets divided in the above categories.

      I’ll also put up tasks I’m working on and others are working on so there aren’t duplicate tests running at the same time.

      I’ll publish a post on it at the time I post the data.

      Great idea.

      Reply
      1. Anonymous

        Sounds great.  Focusing on the initial objective is always a good thing (Google Science Fair deadline in this case).  :-)

        Posting an even larger raw dataset at some point would be awesome.  But I’d really enjoy just poking around the initial 2M records.  I totally understand thought if you want to do your initial analysis before making the raw dataset public.  If I can be of help in any way, just give me a holler.  twitter: @ronclabo

        Reply
        1. Mark Collier

          Yeh cool man,

          I’m going to get my data and results published for the Google Science Fair deadline and then 2 weeks after hopefully I’ll publish the original 12m dataset and a larger one.

          Reply
    1. Mark Collier

      Yo dude,

      I do plan to run regular (quarterly/bi-yearly) tests to verify the results and see the evolution in the Google algorithm.

      But I hope to advance the tests, to look at more factors over larger data sets.

      Plus rather than just look at correlations I hope to prove causation. That’s a really hard thing to do on a scientific level.

      First you have to get a scientifically significant dataset i.e. sign up a bunch of websites to give you access to their site and test things out on their domain.

      Plus you have to make sure you’re only testing one variable/factor at a time and you have to give each factor enough time for Google to notice the change and you have to track all this.

      Not to mention the fact that if you sign up 1,000 websites for the experiment you have to be able to make changes to those 1,000 sites instantly.

      How do you reduce the loading speed of 1,000 sites? It’s a tough but not impossible challenge that a lot of in house SEOs at big firms overcome everyday. So its really just a case of learning what needs to be done.

      Reply
  5. عبد الإله خياطي

    hello there;
    please i need some help if you can of course.
    I’m working on a project and i need a dataset of snippets from google, i
    tried to find the data but i couldn’t find it anywhere, can you please
    tell me how can i download it or just make it myself?
    i will appreciate your help
    regards.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>