Most SEOs understand that on page factors have and will continue to decline in importance within the Google algorithm and SEO.
We all know that Google takes no notice of the meta keywords tag, and little notice of other tags, markups and HTML structures which has been backed up with my data.
But “in content” factors i.e. ranking signals related to the actual text/content on a page, would appear to be a separate matter.
Spamming the meta description doesn’t hurt a user’s experience but spamming the actual content the user sees is detrimental to the user and therefore not worth doing, right? Surely damaging the user’s experience isn’t worth those few extra visitors from Google, seen as they will probably convert less due to the comparatively less helpful content?
It sounds good and makes sense, and as a result we in the SEO community have come to the conclusion that while Google might ignore those other on page factors, they probably have some really smart methods to figure out what a page is about and the quality of the content on that page.
I decided to look at 5 really basic factors that you would think may feature in some element of the Google algorithm or would be closely related to a ranking factor.
While I understand there are better and more advanced methods (LDA, TF*IDF) for comparing the content on a page to a given keyword or judging the quality of a given piece of content I tested these really basic factors to judge Google’s likely weighting of “in content” factors.
I may test more advanced topic modelling algorithms/factors in the future but for now I have stuck to some old information retrieval reliables.
Special thank you to Mike Tung of Diffbot for providing me with free access to their article API which is undoubtedly the best text extraction service/algorithm out there. And as if the solidify that point congrats to the team on their recent 2 million dollar funding round.
This data is based on a dataset of over 1.2 million web pages from over 12,000 unique keywords and the correlations are derived from Spearman’s Rank Correlation Coefficient.
Images and Videos
Having images and videos on a page is generally accepted to be good for the user. What I was interested in, was seeing whether this translated into increased Google ranking or not, and apparently it doesn’t.
Assuming Google likes pages with images and videos (large but reasonable assumption), this is quite a good test of how low-level Google are willing to go to promote pages that are in line with what they want to see i.e. are Google willing to reward these pages in the algorithm or do they prefer the great PageRank algorithm to help them identify what users like and want.
Page Contains Bad Words
This is another fascinating test, I checked whether a page contains any bad word from this list.
I thought that using these naughty words would surely relegate you from the search results or at least lower your ranking, but the data doesn’t support that.
Unfortunately there is a cautionary note, Google likely run a somewhat more advanced algorithm over content, instead of checking for just the presence of these bad words they probably look at their likely intent and as a result either ban a page or give it no penalty.
A news article quoting a foul mouthed sports star shouldn’t be banned from the search results because of its harmless and informative intent.
Because of this the pages that weren’t excluded from the search results and therefore likely received no penalty were the only ones to show up in my dataset.
As a result it would be unfair to draw conclusions regarding Google’s implementation of bans/penalties towards pages using these bad words.
Is the Keyword Even in the Content?
Most of us would think that this test is a sure thing. Forget keyword density, but surely having the keyword in the content of a page is absolutely vital to that page ranking well for that keyword but again the data says otherwise.
How can this be? The most basic and obvious steps for checking whether a page is about a keyword is to check whether the page contains that keyword, how else would Google narrow down their massive index into something more manageable?
Well there is anchor text, meta description, title tags and many other areas that Google may look at to check whether a page is about a keyword or not.
But what this astoundingly low correlation suggests is not only that Google likely doesn’t implement such a factor (when ranking pages) but also that Google probably isn’t using other super-advanced topic modelling algorithms, as most of these algorithms are based on the assumption that the keyword is in the content and all of them are based on the assumption that there is textual content.
Distance to 1st Keyword Match
I was a little more sceptical about this factor correlating well and rightly so. This old-school factor might have been in use in the days of Alta-Vista but most of us would agree its not so likely to be around any more.
While other topic modelling algorithms might correlate higher than the above factors most of them are based on the simple assumption that a page contains the keyword you are trying to model for and that there is textual content, which are dangerous assumptions for Google to make.
Nobody can make a blanket statement like, “Google don’t analyse what you write and don’t care what’s in the content of a page” but the data does point us in that direction.
If you are to accept the theory that Google don’t take such a low-level and specific view of pages or at least don’t weight such a view very highly then it is easy to come up with reasonable justification for that theory.
For example if Google takes such a low-level view then how does it understand infographics or video blogs? How will such an algorithm scale with the web as it evolves further into a multi-media and not just a textual internet?
I don’t believe the data is in anyway conclusive and I do believe that other “clever” topic modelling algorithms may correlate well with ranking but whether or not that means Google implements such factors within their algorithm is another debate.
What I will say is that I believe Google most likely take a much higher-level view of pages than we think, using links, social media, PageRank and other more scalable factors to determine the relevance and quality of a web page rather than looking at on page or in content factors.
As a result I would recommend that all webmasters create content, title tags and web pages with the user and not the search engine in mind and optimize for other far more scalable factors like link building, social media, page loading speed, etc.