Though many of us webdesigners and developers don’t realize it, many of our page hits come from robots, usually known as Web crawlers or “bots”, which index our pages for various search engines and the like.
Bots like these are (generally) harmless, and indeed we designers should be glad for the innocuous presence of anything that helps our pages get better known. According to this article, they can help validate links and HTML, and they can monitor changes in your web page so that search results which turn up your site are as updated as possible.
You do, however, have to be watchful for those few malicious bots, according to this site about web robot abuse, that look to harvest email addresses and other sensitive information for spam attacks; some bots might even be programmed to take down your server with specialized attacks.
To help you design and develop pages with bots and their indexing habits in mind, I have written the following short guide with plenty of sources to find more information. How you want to handle bots on your page depends entirely on how you view them. For example:
If You Don’t Mind Bots at All…
…then you don’t really have to do anything about your webpages. Just keep designing and coding as you have been, and bots will happily index and keep up with all your pages on your site(s).
I personally don’t mind them coming to my site–anything that helps my pages show up in Google Search is a happy thing. But lately, since I’ve been using my domain as a source for a bit of personal storage as well as officially published projects (losing data will make you paranoid), I’ve found myself thinking of how to keep bots out of certain folders. Thus, the section below:
If You’d Like to Keep Bots Out of Certain Folders…
This is where the humble robots.txt file comes into play. This file, usually kept in the top level of your site’s directory, is a list of instructions for robots to follow while visiting your site–most importantly, used to disallow access to certain files and folders that you’d rather not have indexed.
So, for instance, if I wanted to keep bots out of a folder called “mine”, I’d write this little tidbit in my own robots.txt file:
User-agent: *
Disallow: /mine/
(The “User-agent: *” bit tells every robot that visits, “This applies to you, so listen carefully.”)
If, however, I wanted to disallow access to two folders, “mine” and “yours”, I’d need to write my robots.txt file this way:
User-agent: *
Disallow: /mine/
Disallow: /yours/
For everything you want to keep bots out of, you have to write a specific Disallow line. Kind of like keeping kids out of the various cabinets in your house–if you don’t tell them specifically they can’t get into it…well… LOL
And just like some kids deliberately disobey your pleas to keep out of certain drawers and cabinets, there are some bots who will ignore your robots.txt file completely. The only way to be completely sure a bot isn’t indexing your stuff is to take it off your site…sad.
Even more awesomely in-depth information about the robots.txt file can be found at RobotsTxt.org (very well-explained and detailed!).
If You Just Want Bots to Keep Out Altogether…
Of course, if you’ve had a bad experience with phishing and spamming bots, you might just want them all to buzz off. For this, such tags as the following have been invented (courtesy of this page):
<meta name=”robots” content=”noindex,nofollow”>
Put this meta tag in the header of your page, and it’s a virtual “Keep Out” sign for bots everywhere. Only do this, however, if you’re sure you don’t want anything on your site to be indexed, not even by the “good” bots.
Summary
Bots are an often-forgotten portion of web development, but with all the various search engines (and spammers, unfortunately) out there on the big ole Internet, we developers have to at least take them into consideration. I hope you’ve gained some insight on how to either welcome bots to portions of your sites, or how to keep them out!
Learn More about All Sorts of Web Bots
InternetOfficer.com has a complete list of web robots for your perusal, so if you check your web stats and see a number of hits from several strange names, you can check this list and see if a bot’s been visiting you.