How To Figure Out What Parts of Your Website Aren’t Being Crawled

When Google took away the supplemental index last year, they killed one of the key diagnostic tools in the SEO’s toolbox, the ability to identify which parts of a site were unimportant (and being infrequently crawled) in a search engines eyes. However with the use of some structured text and clever searches I’m going to give you back that valuable information.

To tell what pats of a site google (or any search engine for that matter) thinks are important what you need is a way of date tagging when the last time a search engine visited/indexed the page. It’s not important to know that it was crawled 16:42:03 on September 8, 2008, it’s important to know that it’s been over 30/60/90 days since a search engine visited that page. What you need to do is put a month and year time stamp somewhere on every page, I recommend the footer, since proximity isn’t a factor. I’m also going to recommend keeping punctuation and special characters out of the time stamp, in my experience Google gets a bit unpredictable when you introduce those elemnets. I’d suggest you aim for something simple like “Sep 2008”. Next you need to add something that will only appear on your site. The most unique thing will be the site’s proper name (again omit any punctuation or special characters). So you’ll end up with something like “Joes Widget World Sep 2008”.

Once you’ve got that in place the next thing to is wait … at least two full months before you’ll get any good data … yes really. Lets assume you put this change into place today, then on December 1st you’d go to Google (or any other search engine) and type in the following query [“Joes Widget World Sep 2008”] (minus the brackets but with the quotes). The search engine of your choice will then spit out a list of pages with an exact match of the phrase Joes Widget World Sep 2008, or a list of pages that haven’t been crawled since September of 2008, over 60 days days ago … hopefully you just had a lightbulb moment …

One of the problems with this method is if you have pages that aren’t being crawled now it may be a while before they are crawled with the new date code keyword phrase stamp. Unfortunately there isn’t a 100% fool proof method for telling the search engines to deep crawl and re-index your whole site. The best reccomendation I have is to create a complete sitemap and resubmit it, it’s not foolproof or 100% effective, but in many cases it will help.

Once you have identified what pages aren’t being crawled what do you do with that information? The first thing I’d look at is whether the page is valuable or important. Sometimes site owners or publishers create pages that were important at the time but are now useless. For those pages I’d merge or delete them making sure to 301 them properly. If a page has value but isn’t being crawled the next thing I’d do is look to update the copy, and freshen it up a bit. Once that’s done I’d put a link to the page (at the same URL) on a what’s new/updated/revised/changed page (thats hopefully linked to from your homepage). If you’re using wordpress something like the recently updated code would automate the process. Moving the page closer to homepage should get it indexed again. The next thing you should do is look for ways to increase the internal linking to that page. Those steps should help keep the page in the index. I’d suggest keeping a log of your actions so you can see whats going on over time. If you find that the same pages keep re-appearing for these old datestamp searches, it probably an indication that there is something wrong with your architechture or internal linking.

It would be really cool if Google Alerts supported monthly frequency, so you could put the phrase in for every month of the year and automate the process and work smart not hard, but thats currently not an option.

