Finding and fixing canonicalization problems

by ian on July 14, 2010

Canonicalization is a really big word for a really simple concept: Every page on your web site should have a single Uniform Resource Locator, or URL. Canonicalization means ‘having one address and only one address for any one page of my web site’.

An example

This is a URL for a page on GibbleGibbet.com:
www.gibblegibbet.com

This is another URL for the same page:
www.gibblegibbet.com/index.html

And yes, this is yet another URL for the same page:
gibblegibbet.com

Link to your home page using all three different URLs, and you create a canonicalization problem.

So what’s the big deal? To you and I, those are all one page. They’re just tiny differences in the URL, right? No big deal. We understand they’re the same thing.

Alas, computer software – including search engines – isn’t as smart as you and I. They can only read the different URLs and take us at our word: If we use different URLs, we must mean different pages.

A search engine sees each of these URLs, crawls each one as a separate page on the web, and gives each one its own separate authority and relevance score. That causes a whole host of problems:

  1. The search engine won’t rank all three pages. There are only ten spots on the first page of a search results page – they want to make good use of them. So they’ll pick one page and rank it.
  2. But, a search engine will still crawl all three page versions on your site. That uses up time the search engine could use crawling unique pages, so you end up with fewer ranking pages, and shallower indexing of your site.
  3. Worst of all, you lose link authority. If one blogger links to ‘gibblegibbet.com’, another links to ‘www.gibblegibbet.com’ and another links to ‘www.gibblegibbet.com/index.html’, they’re each effectively ‘voting’ for a different page. So, instead of 3 votes for my home page, I’ve got one vote for three different pages. That’s a huge waste.

So canonicalization is important.

Common causes of canonicalization problems

There are lots of ways to create canonicalization higgledy-piggledy:

  • Inconsistent home page linking. Your site probably ‘lives’ at ‘www.site.com’. If you link to it from your global navigation using ‘www.site.com/index.html’ or similar, search engines see two different URLs for the same home page.
  • Session Ids. Many shopping carts and web applications put extra stuff into a URL, like ‘;jsessionid=a830b4a3debd7d753a62?t=0&idx=0′. That’s to keep track of each individual’s shopping cart or session, and keep their data separate from anyone else using the site. But they change every time someone visits a site. So, if Googlebot comes to your site 3 times, it will see a different jsessionid every time for the same page. That creates three different URLs.
  • Tracking urls. If you add something like ?nav=topbar to all links in the top bar of your web site, so that you can track where folks click, you’ve created a huge canonical problem. Search engines will index two versions of those pages: One without the nav attribute, and one with.
  • Affiliate programs. You give your affiliates links like www.mysite.com?affiliate=12345. They all link to your site using their unique codes. Search engines find them all and index them as distinct pages.
  • Capitalization. If you inconsistently capitalize file names on your site, that can create multiple URLs. For example, www.gibblegibbet.com/Index.html versus www.gibblegibbet.com/index.html.
  • Mixed use of http and https.

Detecting canonicalization problems using Google Webmaster Tools

A canonicalization problem will almost always cause duplicate content on your site. You can use Google Webmaster tools to find duplicate content, and then figure out whether the cause is canonical.

  1. In Webmaster Tools, click ‘Diagnostics’, ‘HTML Suggestions’.
  2. Check for pages with duplicate title tags. If you see a list like this…
    Duplicate content showing up in the title tag report

    Duplicate content showing up in the title tag report

  3. Google’s detecting duplicate title tags on your site. Assuming you’ve used unique title tags on your site, canonicalization is the most likely cause of these duplicates.
  4. For each duplicate title tag, click the ‘+’ sign. If you see two URLs that are really similar…
    Addresses that nearly match.

    Addresses that nearly match.

  5. …Then click each page URL and view the pages. If they’re the same, it’s a canonicalization problem:
    Matching page content means canonicalization problems

    Matching page content means canonicalization problems

So, now you’ve found the problem. Time to fix it.

Fixing canonicalization problems

Here are 6 options:

1: Just fix it

The best way to fix canonicalization problems is to fix them.

If you link to your home page 4 different ways, pick one and make your links consistent.

If you added query strings like ?link=1234 all over your site so that you could track clicks, get rid of them. Use something like event tracking in Google Analytics, instead.

Got session IDs all over the place? Get rid of them, and use cookie variables. Repair whatever it is that’s creating multiple URLs for one page of content.

This is hard work. Doing most things right involves hard work. The payoff, though, is that you don’t have to depend on weird, semi-supported tags like rel=canonical or huge webs of complex 301 redirects.

And, if you really fix the problem, then the fix scales: New pages and content will behave themselves, and you’ll have less work in the long run.

Good: Works forever. Makes your site well-coded. Builds good karma. Won’t fail when the search engines buy each other or change their minds about standards or whatever.
Bad: Requires higher thought.

2: Robots META tag

You can use the robots META tag to hide all but one version of the guilty pages.

Say you’ve got a canonicalization problem that looks like this, where all of those URLs go to the exact same page:

http://www.mysite.com/products/
http://www.mysite.com/products/?referrer=homepage
http://www.mysite.com/products/?referrer=catpage

You can fix the problem by telling search engines to ignore the page at all but the first URL. Add this in the <head/> element:

<meta name="robots" content="noindex,nofollow">

Important: You will need to use some kind of conditional logic in your code to only show that robots tag when there’s a ‘referrer’ attribute in the URL. Here’s what it’d look like in plain English:

IF there’s a thing called “referrer” in the URL, then insert <meta name=”robots” content=”noindex,nofollow”> in the page.

And in PHP:

if (.$_GET['referrer']) {
echo "<meta name="robots" content="noindex,nofollow">"
}

Without the conditional logic, you’ll hide every instance of the page, including the nice short one.

Good: Easy. Appeals to the spaghetti programmer in me.
Bad: Somehow, there’s always one case you miss. Only works on dynamic sites.

3: Use robots.txt

Continuing the example from above, you could use regular expressions to exclude all urls that include “referrer” from the search engine index.

Something as simple as this might do the trick:

User-agent: *
Disallow: /*?referrer=

Good: It’s so easy. One little line in the robots.txt file and you’re all set. Sweet!

Bad: If done wrong, may cause your site to fall into a black hole.

4: Use 301 redirects or URL rewriting

If you have a case where the problem stems from inconsistent linking practices you can use a 301 redirect to fix it:

http://www.mysite.com/
http://www.mysite.com/index.html
http://mysite.com/index.html
http://mysite.com

Set up a 301 redirect or, if you’re using Apache, set up URL rewrites (here’s how) from each of the 3 URLs you don’t want indexed to the one that you do. When search engines visit your site, they’ll scoot over to the correct page and index that one.

They’ll even apply most of the link authority from the incorrect URLs to the correct one. This is also your best bet if external sites are linking to the wrong home page URL.

Good: Easy (if you have server access). Approved by all search engines. Can also be done using a scripting language like PHP. Works for external links to your site, too.
Bad: Tedious, if you have a lot of inconsistent linking. Requires server access (or a programmer).

5: Webmaster tools

Google Webmaster Tools will let you exclude parameters in the toolset. Log into Google Webmaster Tools, then go to Settings and click ‘Adjust parameter settings’. Using the ‘referrer=’ example, you’d do this:

Excluding a URL parameter in Google Webmaster Tools

Excluding a URL parameter in Google Webmaster Tools

Voila. Googlebot will strip out those URL attributes. You can also set your preferred domain on the same screen:

google-preferred-domain.gif

Good: It’s easy.
Bad: May not prevent canonicalization problems. Not supported by Yahoo! or Bing.

6: Use rel=canonical

Finally, you can use the rel=canonical attribute. The big three search engines have all said they support the rel=canonical attribute as a way to tell them which version of a URL is the correct one.

If you had multiple links to your home page, like this:

http://www.mysite.com/
http://www.mysite.com/index.html
http://mysite.com/index.html
http://mysite.com

…but you want ‘www.mysite.com/’ to be the ‘real’ URL, then you’d put the following tag in the <head> area of your home page:

<link rel="canonical" href="http://www.mysite.com/" />

The search engines supposedly take that link tag and forward all link authority and relevance to the canonical version of the page.

Good: It’s pretty straightforward.
Bad: In testing, I’ve yet to see 100% consistent support across Yahoo!, Bing and Google. Plus, no one ever seems to get it right. And, developers have a nasty habit of removing the tag later on.

My favorite option: Just fix it

Don’t resort to all of the band-aid fixes I’ve listed if you can simply fix the problem (option 1). It may take longer. It may make your developers hate you. But it’s the one method I’ve seen work 100% of the time. Only use the other 5 methods if/when you must.

I know this is a convoluted topic. If you’re better at listening/watching than reading, here’s a video that explains all of this, over again:

Canonicalization is a simple concept that’s hard to explain. This video defines it, and shows a few fixes.

Related/other modules in this section:

  1. Checking for duplicate content
  2. Setting up webmaster tools
  3. Diagnosing and fixing SEO roadblocks
  4. Building authority with hub pages
  5. HTML rules for readable online copy

Previous post:

Next post: