When you are reviewing a website, whether for your own projects or for a client project, one of the important areas to review is crawlability. In this post I’d like to talk about some of the ways you can look for and diagnose crawling issues.
[pullquote]If your important pages aren’t within 2-3 pages of linking hubs on your website, you will have problems …[/pullquote]The first step to diagnosing a crawling problem is to use a simple [site:example.com] search and compare how many pages you really have with how many Google thinks you have. Now, bear in mind that this number is an estimate. What you are trying to do is get a rough estimate of how many pages Google knows about, as Matt Cutts recently discussed in a Webmaster Central Video:
If you have several hundred or thousand pages but Google only shows 100, then you have a problem. Depending on how large the site is, anywhere from 10-30% accuracy would be a good rule of thumb.
The second thing you would want to look at would be Webmaster Central. If you submit a sitemap, Google tells you how many URLs you submitted and how many are in the index. The closer those numbers are, the better. Don’t worry if it’s not a 100% match because sometimes you include pages in your sitemap that get blocked at the page level with a robots meta tag. At this point, you are just concerned with gross numbers.
If things are radically out of whack, you can download a table of pages in the index from webmaster central and diagnose on a page by page level to see what is or or isn’t in the index.
Next, you want to try and do a full crawl of the website using something like Xenu. While it’s usually used to check for broken links, in the process it does crawl the website. If you have a large website, you are going to want to limit the crawling.
Another product that I like to use is Website Auditor. One of the interesting things about using Website Auditor is that you can specify crawling depth, which is how deep you want a crawl to go. Start at the homepage and go only one level. Run it again, this time with 2 levels, then 3. Additionally use your Webmaster Central report on most linked pages (think of them as link hubs). If your important pages aren’t within 2-3 pages of linking hubs on your website, you will have problems. IMHO it’s more important than ever to cultivate deep linking and to use that deep linking to spread your link equity, inbound trust, and authority wisely around your website.
Pages that have the most links are going to get crawled more frequently. Pages that have the most trust and authority are going to get crawled most often. Pages that are linked to from those linking hubs, or trusted and authoritative hubs, will get crawled next most frequently. At each step away from the linking hubs, or authority points, crawl frequency will decrease–think of it like a classic pagerank model.
Ideally, what you want to do is get a sample of pages from different levels and determine their crawl frequency rates. Programs like Website Auditor will do this for you; however, you are probably very likely to trip up the Google automated query blocker, which means that, if you are going to use it, you’ll have to have someone sit there and do captcha’s for a few hours. A secondary method is to use an outsourcing service (I use ODesk) and have them do it for you. Send them several hundred URLs in a spreadsheet and explain to them how to check the cache date and enter it in the spreadsheet. You should do some spot checking when it comes back and try to find a handful of people you can trust. Use them on a regular basis.
So how do you know you’re in trouble? Do your important pages have crawl dates older than 60 days? Are there entire sections that you think are important that aren’t getting crawled as frequently as you want?
If you find that the site you are working on has crawling issues, look for ways to flatten out the site hierarchy. I talked about this in How do You Archive Posts on a High Volume Website. Look at your pages that are linking hubs: are you using them wisely by interlinking to other content? I talked about this in How to Silo Your Website: The Content and breadcrumbs are another key tactic for interlinking. Look into ways to rotate some of that older content onto your homepage (see how to make your homepage more dynamic). Lastly, look for ways to better use your link equity by performing a content audit and killing/removing/updating old, outdated, or unimportant sections.
So what are the takeaways from this post:
- Take a quick estimate of how well crawled your site is
- Look at the pages in the index using webmaster central
- Identify link hubs on your website
- Try test crawling to different depths
- Check cache dates across a section of URLs for your website
- Identify trouble spots, flatten site architecture, improve interlinking, and trim down unimportant pages/sections