After installing zbblock, I took a look at all the bots visiting my site. There are obvious ones that are welcome, Google, Bing, Yahoo to name but a few. Others are not so welcome. They eat up your bandwidth and give nothing back in return. For example, Yandex is a Russian search engine. It is very unlikely someone in Russia would stumble across my page and become a regular visitor, so there is no benefit for me to me on the Yandex search engine.
Another type of bot crawls your pages to analyse your links, it then tries to sell that information back to you. As a hobbist, I am not interested in paying for that sort of analysis, particularly as Google Analytics provides similar information for free.
The third sort of unwanted bot is the sinister one, looking for exploits to attack, but beyond the scope of this article.
All bots should obey robots.txt. This is a file created by the webmaster telling these bots where they are not allowed to go.
Allow full access
My problem started when I wanted to start blocking some of these bots. First thing to remember is that robots.txt does not block bots, it provides an advisory for well behaved bots to read. Second is that the specification is not great, and some bots may not understand complicated syntax. I started to add a few unwanted bots, but could not find any definitive way to format the file.
If I wanted to ban a list of bots do I…
There did not seem to be anyone prepared to commit themselves on this. My answer is that how multi sites are interpreted depends on the bot. NosyBot may obey the first one, AnnoyingBot may misread it and index the site.
At this point, I decided that a dynamic robots may be the best way, only providing the basic “allow all” or “forbid all” depending on the detected block. This will ensure the bot will understand the syntax if it chooses to obey.
// Save file as botcheck.php
This creates the robots.txt dependant on the user agent. An eye should be kept on the logs to make sure good bots are not rejected and bad ones do not sneak through. Bear in mind, this does not stop bots, it passes on the fact whether the bot is welcome or not. Even an unwelcome bot is free to visit whatever pages he likes.
The final step is to get the webserver to return this script when robots.txt is called. One way is to configure apache to treat txt extensions as a php script, but that may be a bit extreme. The method I chose was a mod_rewrite in .htaccess.
RewriteRule ^robots.txt$ /botcheck.php [L]
And thats it. robots.txt is now dynamically server according to the requesting http agent.