I've seen a lot of 'bad' (i.e. non-working) 'robots.txt' files lately, so I want to give a few tips.
'robots.txt' is a file that 'well-behaved' (not all!) spiders and search engines respect and check for. It is used to specify what you DON'T want the spiders to see.
Here are some guidelines:
1. The filename MUST be 'robots.txt'.
EXACTLY that. All lowercase, plural (i.e., NOT 'robot.txt' or 'Robots.txt'.
2. The file MUST have 'Linux-type' (i.e. Linefeed, "\n") type line endings.
NOT Mac (CR), NOT DOS (CRLF). If you work on any of these, use an editor that allows to save in 'UNIX Mode'.
3. The file MUST be in your web root.
If you put it in, say '/catalog/', no bot will ever see or check it. ALWAYS put in your web root, i.e. "http://www.mydomain.com/robots.txt".
4. Comment lines.
You CAN have comment lines. They start with a '#' in column ONE. Be careful NOT to separate too much using empty lines (see rule 9)!
CODE
# BAD: Not starting in column 1.
CODE
# GOOD: Start in column 1.
In theory, it IS allowed to put comments on a 'Disallow' or 'User-agent' line like 'User-agent: Googlebot #this is Google'. DON'T USE IT! It is bad practise, and some spiders will misinterpret it and instead spider what they weren't supposed to.
CODE
# BAD: Comments on the same line.
User-agent: Googlebot # Google
User-agent: Googlebot # Google
CODE
# GOOD: Comments on separate lines.
# Google
User-agent: Googlebot
User-agent: Googlebot
5. White space.
In theory, it IS allowed to use white space (i.e., empty lines or indentation by blanks or tabs). DON'T DO IT! Some spiders will misinterpret it. ALWAYS start comments, 'User-agent:' and 'Disallow:' in column 1. And DON'T use tabs but blanks instead.
CODE
# BAD: Indentation (not starting in column 1).
User-agent: Googlebot
Disallow: /admin/
User-agent: Googlebot
Disallow: /admin/
CODE
# GOOD: Start in column 1, use ONE blank after the ':'.
User-agent: Googlebot
Disallow: /admin/
User-agent: Googlebot
Disallow: /admin/
6. User-agent: [spider's name]
Type it EXACTLY like this. It means the spider. You CAN put '*' to make it target ALL spiders.
CODE
# BAD: all lowercase
user-agent: *
# BAD: all uppercase
USER-AGENT: *
# BAD: no '-', 'Agent' has uppercase 'A'
User Agent: *
user-agent: *
# BAD: all uppercase
USER-AGENT: *
# BAD: no '-', 'Agent' has uppercase 'A'
User Agent: *
CODE
# GOOD: (Google)
User-agent: Googlebot
# GOOD: (ALL spiders)
User-agent: *
User-agent: Googlebot
# GOOD: (ALL spiders)
User-agent: *
7. Disallow: [path/filename to exclude]
Use ONLY ONE path and/or filename per line, i.e. NOT "Disallow: /cgi-bin /stats"!
CODE
# BAD: Multiple paths/files on one line
Disallow: /cgi-bin /stats
Disallow: /cgi-bin /stats
CODE
# GOOD: One definition per line
Disallow: /cgi-bin
Disallow: /stats
Disallow: /cgi-bin
Disallow: /stats
Disallow works like an 'automatic wildcard' (without '?' or '*') by matching from the left, i.e. "Disallow: /help" would match the DIRECTORY "/help", the directory "/helpfiles", the FILE "/help.htm", the FILE "/helpfile.php" and so on.
So if you want to exclude a complete directory but NOT files with same name (i.e. you want to exclude the '/catalog/elmar/' directory but NOT 'elmar_start.php', it is good practise to write it like "Disallow: /catalog/emar/" (with ending '/').
CODE
# GOOD: Disallow directory 'elmar' but not 'elmar_start.php'
Disallow: /elmar/
Disallow: /elmar/
8. There is NO 'Allow:'!
If you want to allow anything, you must disallow the rest and put an empty 'Disallow:' at the end!
CODE
# BAD: (intended: disallow 'Jane' but not 'John')
Disallow: /Jane
Allow: /John
Disallow: /Jane
Allow: /John
CODE
# GOOD: (disallow 'Jane', allow all the rest)
Disallow: /Jane
Disallow:
Disallow: /Jane
Disallow:
9. NEVER have a BLANK LINE BETWEEN 'User-agent:' and it's corresponding 'Disallow:' lines!
Some spiders will mis-interpret this as to be allowed spidering your whole site. You CAN have comment lines in between.
CODE
# BAD: Blank line between 'User-agent:' and 'Disallow:'
# This should exclude Google
User-agent: Googlebot
# And here we say which to exclude
Disallow: /
# Result: Some spiders will instead assume they're ALLOWED you whole site!
# This should exclude Google
User-agent: Googlebot
# And here we say which to exclude
Disallow: /
# Result: Some spiders will instead assume they're ALLOWED you whole site!
CODE
# GOOD: NO blank lines between 'User-agent:' and corresponding 'Disallow:'
# This should exclude Google
User-agent: Googlebot
# And here we say which to exclude
Disallow: /
# Result: Google will be kept from spidering your whole site.
# This should exclude Google
User-agent: Googlebot
# And here we say which to exclude
Disallow: /
# Result: Google will be kept from spidering your whole site.
10. Always go from 'more specific' to 'less specific'!
Start with the most specific rules, then go to the least specific. This means, the part for 'User-agent: *' should come LAST in your 'robots.txt'! The reason: If a spider sees 'User-agent: *' FIRST it might stop scanning since it's one of 'All spiders', so it'll not bother to look through the rest of your file if it's specifically addressed elsewhere!
CODE
# BAD: Spider might not honor this
# Allow everything to all other spiders
User-agent: *
Disallow:
# Disallow Google
User-agent: Googlebot
Disallow: /
# Allow everything to all other spiders
User-agent: *
Disallow:
# Disallow Google
User-agent: Googlebot
Disallow: /
CODE
# GOOD: First do the specifics, then the 'rest of them'
# Disallow Google
User-agent: Googlebot
Disallow: /
# Allow everything to all other spiders
User-agent: *
Disallow:
# Disallow Google
User-agent: Googlebot
Disallow: /
# Allow everything to all other spiders
User-agent: *
Disallow:
11. Use a 'robots.txt' validator.
One might make mistakes. It's good practise to check using a validator.
Here's a good one (has some examples even):
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
And here's one that checks on even more potential problems:
http://tool.motoricerca.info/robots-checker.phtml
12. One more tip: Search engines get clever.
If you really have to run a lot of sites... Hey, comparing their 'robots.txt' files is FAST and makes it VERY easy for SEs to find if they're all the same... and so they start assuming they get tricked and rank you down... ;-)
Here's an example 'robots.txt':
CODE
# osCommerce robots.txt
# Currently disallow all shop stuff to the Google Image bot
# Mainly image hunters anyway, they eat up bandwidth...
User-agent: Googlebot-Image
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/
# ALL search engine spiders/crawlers (put at end of file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/admin/
Disallow: /catalog/download/
Disallow: /catalog/elmar/
Disallow: /catalog/pub/
Disallow: /catalog/account.php
Disallow: /catalog/advanced_search.php
Disallow: /catalog/checkout_shipping.php
Disallow: /catalog/create_account.php
Disallow: /catalog/login.php
Disallow: /catalog/password_forgotten.php
Disallow: /catalog/popup_image.php
Disallow: /catalog/shopping_cart.php
# Currently disallow all shop stuff to the Google Image bot
# Mainly image hunters anyway, they eat up bandwidth...
User-agent: Googlebot-Image
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/
# ALL search engine spiders/crawlers (put at end of file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/admin/
Disallow: /catalog/download/
Disallow: /catalog/elmar/
Disallow: /catalog/pub/
Disallow: /catalog/account.php
Disallow: /catalog/advanced_search.php
Disallow: /catalog/checkout_shipping.php
Disallow: /catalog/create_account.php
Disallow: /catalog/login.php
Disallow: /catalog/password_forgotten.php
Disallow: /catalog/popup_image.php
Disallow: /catalog/shopping_cart.php
Have fun! And happy 'spidering'...
Matthias





