In recent years it has become increasingly more difficult to hide information (including data, pages, new domains, and entire websites) on the internet. In some cases, the discovery’s results can range from being mildly embarrassing to to completely giving away a business plan. In this post I’ll talk about some of those holes and how to protect yourself.
When you register a new domain, especially one of the popular high level TLD’s such as .com, .net, or .org it gets recorded in numerous places. One of the most visible is AboutUS.org (see my page). For many non branded URL’s, the AboutUS.org domain usually ranks in the top 10 and presents an opportunity to engage in some reputation management. Since you can edit your listings on AboutUs.org you can also engage in some low level link building. Unless you have edited it, AboutUs.org will scrape one of your pages by default. You can always go back and edit the resulting entry and add in links. All of the links from AboutUs.org are nofollowed (it’s that SEO Blackhole thing); however, when the AboutUs.org page gets scraped, the nofollow tag doesn’t always get applied. These aren’t going to high powered links but they will add up over time.
Additionally, whenever you register a new domain, that registration info is captured by services like DomainTools.com. One of the services domain tools offers is the ability to see archived “who is” info. If the domain has changed ownership and if the registration info was updated, that information will be available for a price. Another piece of information offered by domain tools is the ability to search for identical/similar registration info. You can search by the person’s name or company name and (again for a fee) get a list of other domains listed with similar registration info. You can try to block and obfuscate that data with private registrations, but I’ve seen some evidence that having public whois data can help rankings … so you have to decide which is more valuable.
I’ve mentioned numerous times that in most cases it’s not worth wasting your time chasing down people who are scraping your content. In fact in most cases you are better off just inserting links to yourself and letting the scraping continue. If you are selling information and someone is copying you or if you are being outranked, then it’s worth fighting; otherwise it’s like herding cats. In some cases this scraping is legit and will turn up in odd places. For example in my post telling people to not be linking to public testing, I was doing a little testing myself (c’mon you have to appreciate the humor of testing in a post about testing). I inserted this phrase in the meta description:
constagagulation is fun constagagulation is nice constagagulation for everyone, constagagulation is fun constagagulation is nice constagagulation for everyone
Google ignored the meta description and didn’t index it. Topsy.com, however, came by, scraped the page, and added the meta descption, which Google then indexed. What’s the lesson here? Any information that’s publicly accessible on the internet has the potential to be copied, duplicated, and reproduced. Don’t count on spiders/bots being polite and respecting disallow directives in your robots.txt file. I know more than one person who looks for blocked directories as a place to start looking for secret information. If you want to keep information from being copied, you need to put it behind a password.
Social Media and Social Engineering
In my opinion one of the unintended outcomes of social media has been the connection of social data points. Sites now try to actively link as much information about you as possible, using your public profile to flesh out your social graph. Using common things like webmaster central verification files, google analytics ID’s, and adsense publisher ID’s, it’s becoming increasingly easier for even a novice to determine what sites you own or are connected to in some way. To avoid this, there are two choices: 1. build everything in isolation and hope that no one notices the times you do mistakenly cross the streams; or 2. mix the information with mis-information so that it becomes increasingly difficult to separate truth from fiction. We no longer live in a world where “what happens in Vegas stays in Vegas.” In the world today, “what happens in Vegas ends up on Twitter and Facebook”. Once it’s set free in the wild ‘net, you can’t pull it back, so choose your friends wisely and hope they are smart enough to know when not to cross the line.