WARNING: this is super geeky but you do need to know this, so here goes: Canonicalization is a file methodology that exposes a flaw in the modern search engine and the way it indexes websites. If you learn to exploit the flaw, your page rank and traffic for both your website and blog will soar, if not, your website can flounder. This first part will explore the flaw and some of its impacts. The second part will go into the gory detail of how the wrong canonicalization can literally kill a website and how to prevent that.
What is Canonicalization (C14N)
So, beyond a serious point scorer in Scrabble, what exactly is it and why do I care? Canonicalization according to Wikipedia is the process of converting data that has more than one possible representation into a "standard" canonical representation. A more concise description of it and how it relates to the web is Matt Cutts explanation: Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages.
C14N is an issue – it is a source of confusion to search engines
Let’s jump right into an example. Search engines read the following urls as if they are totally different websites:
http://technorati.com/
http://www.technorati.com/
You see the exact same thing when you go to these different urls, right?
Now run a “site:” against each of those in Google:
Here’s the link to my results:
Site:http://technorati.com/
You get 436,000 page results
Site:http://www.technorati.com/
You should get 299,000 page results
As you can see, Google was clearly confused by the small difference in the canonicalization of those 2 urls. While they should have returned the same number of results but they didn’t (while many of the pages between the two sets of results had the exact same content, Google saw them as two different pages on two different sites). This proves that what we as users know to be one website, Google believes is two.
What is the impact of inconsistence canonicalization?
What would end up happening because you have 2 different sets of indexed pages for the same site is that some traffic will go to the www address while other traffic will go to the non www addy. What this proves is that subdomains matter. You need to keep your blog and website on a common subdomain to keep all pages and traffic in a place where Google and Alexa can index and measure it.
Duplicate content filters and canolicalization
Think about it. If Google thinks these results comes from two different sites, how do you think the duplicate content filters will respond? Exactly- when you splinter your subdomains and pull back the same content on each of them (that is inadvertent- it just happens), it can trip Google’s duplicate content filters. Penalized for your own content on your own site. That stings, doesn’t it?
Non-uniform URL’s caused by C14N- means you have two sites with different traffic and page rank stats fighting one another in the SERPs
I think no one will dispute the fact that page rank and SERP are related to inbound and outbound links so what happens when you have a www domain and a non-www domain? You will get a certain amount of links to one and also links to the other but it is still the same website. Since there is an imbalance in linking you will get different page ranks for the same site.
Consider the cost: A user does a search for “the next big thing” and Google’s indexes have this listed in two different places as demonstrated by running the site: query so you are now essentially fighting yourself for a search result. Wouldn’t it have been better if they were always in the one index?
BTW – As nice as it may be, your web server is not terribly smart and it reports the hits to www and non-www as two separate domains so now you have to weed through the logs to find your true hits.
Now, let’s prove that search engines treat sub-domains differently and in a way that can cause uneven traffic and lack of visitors to the primary site AKA www. So, a search engine sees your one site as two different ones. So what? They are both still you, right? Hold up- it means you have 2 different page ranks, two different traffic statistics.
OK, let’s stop there… For all of you that actually got down to this part. I am holding up my Secret Decoder Dork Ring and saying Wonder Geeks Activate!
In the next part- I’ll go through what all this really means, how it can kill your website if you have a blog on another subdomian and how to fix it easily.
Mary holds up her decoder and and says "Wonder Geeks Activate!" as I stay tuned... Luckily I have read Matt Cutt's explaination a few times so I am not totally lost here....