Robots.txt - Robots.txt is a text (not
html) file you put on your site to let the know search robots which pages you
would like them not to visit. Robots.txt is by no means mandatory for search
engines but in general search engines obey what they are asked not to do. It is
important to clarify that robots.txt is not a way from preventing search
engines from crawling your site (i.e. it is not a firewall, or a kind of
password protection) and the fact that you put a robots.txt file is something
like putting a note “Please, do not enter” on an unlocked door – e.g. you
cannot prevent thieves from coming in but the good guys will not open to door
and enter. That’s why
we say that if you have really sensitive data, it is too naïve to rely on
robots.txt to protect it from being indexed and displayed in search results.
If we talk in a simple way it
is one of the best way to let search engine know which files and folders on
your Web site to avoid is with the use of the Robots meta tag Ourple. But since not all
search engines read metatags, the Robots matatag can simply go unnoticed. A
better way to inform search engines about your will is to use a robots.txt file.SEO Question
Meta
Robots Tags, About Robots.txt and Search Indexing Robots
Entry
|
Meaning
|
User-agent:
*
Disallow: |
Because
nothing is disallowed, everything is allowed for every robot.
|
User-agent:
mybot
Disallow: / |
mybot
robot may not index anything, because the root path (/) is disallowed.
|
User-agent: *
Allow: / |
For all
user agents, allow.
|
User-agent:
BadBot
Allow: /About/robot-policy.htmlDisallow: / |
The
BadBot robot can see the robot policy document, but nothing else.All other
user-agents are by default allowed to see everything.This only protects a
site if "BadBot" follows the directives in robots.txt
|
User-agent:
*
Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private |
In this
example, all robots can visit the whole site, with the exception of the two
directories mentioned and any path that starts with private at the host root directory,
including items in privatedir/mystuff and the file privateer.html
|
User-agent:
BadBot
Disallow: /
User-agent:
*
Disallow: /*/private/* |
The
blank line indicates a new "record" - a new user agent command.
All
other robots can see everything except any subdirectory named
"private" (using the wildcard character)
|
User-agent:
WeirdBot
Disallow: /links/listing.html Disallow: /tmp/ Disallow: /private/
User-agent:
*
Allow: / Disallow: /temp* Alllow: *temperature* Disallow: /private/ |
This
keeps the WeirdBot from visiting the listing page in the links directory, the
tmp directory and the private directory.
Allother
robots can see everything except the temp directories or files,but should
crawl files and directories named "temperature", and shouldnot
crawl private directories. Note that the robots will use thelongest matching
string, so temps and temporary will match the Disallow, while
temperatures will match the Allow.
|
Bad Examples - Common Wrong Entries
|
|
use one of the robots.txt checkers to see if your
file is malformed
|
|
User-agent: googlebot
Disallow / |
NO!
This entry is missing the colon after the disallow.
|
User-agent: sidewiner
Disallow: /tmp/ |
NO!
Robots will ignore misspelled User Agent names (it should be
"sidewinder"). Check your server logs for User Agent name and the
listings of User Agent names.
|
User-agent: MSNbot
Disallow: /PRIVATE |
WARNING!
Many robots and webservers are case-sensitive. So this path will not match
any root-level folders named private or Private.
|
User-agent:
*
Disallow: /tmp/
User-agent:
Weirdbot
Disallow: /links/listing.html Disallow: /tmp/ |
Robots
generally read from top to bottom and stop when they reach something that
applies to them. So Weirdbot would probably stop at the first record, *.
Ifthere's
a specific User Agent, robots don't check the * (all useragents) block, so any
general directives should be repeated in thespecial blocks.
|