Recently on this blog I went through the process of moving from a dynamic URL structure to static looking URL structure (ie example.com?p=100 to example.com/foo/). Along the way I learned a little bit more about how wordpress works and discovered a way to use this to create duplicate content on someone else’s wordpress blog.
Before we get into the technical nitty gritty (we’ll get there I promise) let’s take a look at duplicate content. Duplicate content comes in two distinct flavors, internal and external. External duplicate content could be defined as having the same copy existing on one or more domains, that’s not what we’re talking about in this post. The second type of duplicate content would internal duplication or the same content or more than one page of your website (ie example.com/foo/ and example.com/bar/ ). According to the Google Webmaster Guidelines that’s not a good thing:
Don’t create multiple pages, subdomains, or domains with substantially duplicate content.
Now that we’ve go that covered lets get into the details of how to get it done. For an example were going to use Google’s very own spam assassin Matt Cutts. Let’s take a look at this URL
In it we see Matt dressed as Inigo Montoya off to fight evil spammers. However the exact same content can be found on these URL’s
OK let’s take one step back, wordpress has a feature that allows you to change the dynamic URL’s you normally serve into more SE friendly URL’s. However when you activate the feature there isn’t a way to turn off the dynamic URL structure. So what’s the deal is this really bad? For the answer let’s look to Matt Cutts blog for the answer:
Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:
But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google â€œcanonicalizesâ€ a url, we try to pick the url that seems like the best representative from that set.
Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, donâ€™t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.
All right so you overly nit picky readers will say hey GW isn’t this a canocalzation problem, and not a duplicate content one? Yeah I guess you could say so, but would you have read this far if the post was titled “How to Create a Canocalzation Problem on Someone Else’s WordPress Blog“? Now I’m pretty sure folks at Google are able to sort out that example.com?p=100 and example.com/?p=100&c=more are the same URL, they may even be able to sort out example.com/?p=100 and example.com/index.php?p=100 are the same, although I wouldn’t put it into practice on a website I cared about. However I think showing them http://www.mattcutts.com/blog/my-name-is-inigo/ and http://www.mattcutts.com/blog/index.php?p=75 can create problems. So what now that I’ve linked to Matt’s blog and caused the spiders to find duplicate content will his site sink to the nether regions of the supplemental index never to be seen again. I think there’s a little more to it than that. To see any real difference you’d have to link to more than one page in that manner, and to get the job done right you’d want to make sure every page existed under as many URL’s as possible. Should take anyone with any sort of programming skills about 15 minutes if they typed with one hand tied behind their back, so it’s not a tool something only the elite black hat programmers have access to. Here’s another thing, some of you may have already inadvertently done this to yourselves. If you were serving dynamic URL’s and switched to static URL’s and didn’t use any mod rewrite rules your content is probably sitting out there under two URL’s now.
So is this really a problem, or have I just created this big bogey man? Well since Matt’s on vacation we can’t expect a clarification from him on the issue. However we’ll try dropping a link to the Google Sitemaps Blog and we’ll invoke the name of Adam Lasnik and see if he wanders by and can shed some light on the issue.
- Google Webmaster Guidelines
- Matt Cutts: Â» SEO advice: url canonicalization
- mod_rewrite Cheat Sheet – Cheat Sheets – ILoveJackDaniels.com
- URL Rewriting | redirecting URLs with Apacheâ€™s mod_rewrite
- Duplicate Content Observation