What I Learnt About 342,740 Domains

I recently did a parse of 12,573 keywords, extracting the top 100 results per keyword on Google. And after cleaning up the data I was left with over 1.2 million web pages and 342,740 unique domains.

For the last week or so I have been looking for interesting information within this mountain of data.

I published data on top domains, sites, Google’s use of Images, Products and News results and some strange URLs I noticed.

This data is part of my project to bring more science to SEO, initially by doing a correlation study into Google’s algorithm.

Domain Data

I was looking into domain related data and I spotted some interesting patterns, nothing ground breaking but just some stuff you might find cool.

I should have some domain related correlation data out later this week so this is a insight into the domain dataset.

The domain name and the ending for that domain is a really important choice for a new webmaster. When you make that choice your choosing your brand for life, plus its a super important decision from an SEO point of view.

There are a couple of considerations to take into account. The user and the search engine. The correlation data I’ll show you later this week should take care of the SEO point of view.

But from a user’s point of view you obviously want a domain that’s memorable, easy to type and easy to link to. And that in and of itself is a big factor in SEO. If people are linking to the wrong domain you’re losing out on valuable link juice. If people can’t type it then you’re losing out on type in traffic and if user’s can’t share it then say goodbye to some social media clicks.

Domain TLDs

The domain ending or TLD (Top Level Domain) is probably the most important part a website’s address.

There is a technical difference between a TLD and a domain ending. For example .uk is a TLD but .co.uk is not, it’s technically a subdomain of the TLD. But webmasters and users don’t care about technical definitions, so I’m going to treat them as the same for the purposes of this article.

If you have a really catchy, social, SEO perfect site name it’s useless unless you have the right ending to that great name.

Turns out not that many people are typing www.greatname.washingtondc.museum.

I extracted the domain endings for all the sites in my data and collected a neat list of all the domain endings I could identify in these 1.2 million URLs.

I used a list of all known TLDs from the Mozilla crew (I think?), but I can’t find the link so if you know the list I’m talking about please post the link in the comments section.

Luckily I downloaded and cleaned up the list into a nicely formatted text file so you can iterate through and check for matches if your running tests yourself.

Update: Thanks to Kris Turkaly who left a link to the list in the comments: http://publicsuffix.org/list/

After running my scripts and programs through the data it turns out there were 437 different domain endings in the dataset.

Thinking about it that’s a pretty small number of TLD’s for 1.2 million URLs but as you will see there is huge dominance with just 3 of domain endings.

I ranked them in order of number of sites out of the 342,740 that had that TLD. Here’s a handy Excel list of  all 437 TLD’s in that descending order.

And a nice graph of the top 5 domain TLDs:

(You can hover over each bar with your mouse and you’ll get the exact numbers)

It was hard to see some of the smaller TLDs so here’s the next 45 (blown up and zoomed in) and .us repeated again i.e. the 5th-50th most popular extensions. Even within this subset there’s a really huge drop off from the top of the list. Combine that with the top 5 domains and you see a gigantic dominance of the top 2 or 3 domain endings.

Of course this dataset isn’t designed to find the most popular TLD’s but it’s probably a pretty good idea of what users are used to.

Realistically .com and .org are the only global domain extensions you should be going for from a user’s point of view. And even if you own a .org you should be on the lookout for the .com variation.

Domain Length

I thought it would be interesting to see the distribution of domain lengths, so I counted the length in characters of all 342,740 sites without the domain ending.

Out of interest the average domain name length was 14.75 characters long, but the most common length was 8 characters with around 1 in 12 sites 8 characters long.

Again you can hover over each circle for exact stats.

Here’s an Excel file based on the above graph with the domain lengths and number of domains with the corresponding length.

 

This post is pretty good at showing what users are used to and are likely to accept. Of course common sense is still required. For example if your in Ireland, Irish users are very much at home and used to .ie domian names so maybe grabbing that and the .com domain name might be a good idea.

Again, if you find a really great, catchy name that’s longer than normal then go for it and if you find a short one that ticks all the boxes go for it too.

Hope you enjoyed this post and stay tuned for some correlation data later this week.

11 thoughts on “What I Learnt About 342,740 Domains

    1. Mark Collier

      Thanks Alexsandro,

      Yes I’m using a Python script that I wrote and the http://www.TrustedProxies.com Google Extractor proxy service.

      I looked at the Google Custom Search API, but it was quite expensive, there are quite a few restrictions, it might not be as reliable in terms of matching normal search results, etc. so I decided to use the Python script.

      Reply
        1. Mark Collier

          I’m planning a post about how to do an automated crawl of Google. In fact I had recorded a video but now that Google have changed their SERP URLs (seerinteractive.com/blog/google-scraper-in-google-docs-update) I have to update the script and post.

          When I have that done I’ll post it online with a general methodology for scraping Google so users coding in other languages can use the same process.

          Reply
          1. Alexsandro

            Ok, thank you.
            So you get this javascrit which run under google docs and convert it to Python. Nice!!

            I will convert it to C#.

            I was not sure if peoples was crawling google page result.

            But, seeing this, you are crawling results, good!

            Now, I am thinking about performance. I know, it depend our and TrustedProxies internet connection quality. I will see this.

            Thank’s Mark, see ya!

          2. Mark Collier

            Great stuff. When you have your conversion complete would you mind emailing me and when I post the Python script and the video I can include the C# version.

            It would be great for readers who don’t know Python.

            I’d be more than happy to credit you with the script and link to your site.

            Thanks

  1. Kris Turkaly

    Hey Mark, the TLD list I believe you mention is http://publicsuffix.org/. I’m currently working on an excel plugin to parse URL strings – it’s amazing that there’s (to my knowledge) no way to parse out TLDs without comparing the string to a known list. But anyhow, enjoyed the post (my first time here) and will be back to read more.

    Reply
    1. Mark Collier

      Hey Kris, 

      Yes, very well spotted on the TLD list that’s the one I used, I’ve updated the post accordingly.

      I agree sort of strange there’s no other/better way to separate them.

      Feel free to Tweet me the Excel plugin when its done and I’ll reTweet it.

      Glad to hear you like the blog.

      Thanks

      Mark

      Reply
  2. Rick Noel, eBiz ROI, Inc.

    Great post Mark. I love it when practitioners apply science to SEO. I appreciate you sharing your data sets also. It will be interesting to see how gTLD impacts SEO when those begin implementation later this year. For now, .com seem to rule. Thanks for sharing. 

    Reply
    1. Mark Collier

      Thanks Rick,

      Yes .com does seem to dominate at the moment. I’ll post some correlation data supporting this on Monday. Certainly will be interesting to see how the new TLDs fare.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>