Learning Online

Robots.txt

The Robot Exclusion Standard is also known as the Robots Exclusion Protocol or robots.txt protocol. It is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Example:

This example tells all robots that they can visit all files because the wildcard * specifies all robots:

User-agent: *
Disallow:

The same result can be accomplished with an empty or missing robots.txt file.

This example tells all robots to stay out of a website:

User-agent: *
Disallow: /
 

This example tells all robots not to enter three directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
 

 

This example tells all robots to stay away from one specific file:

User-agent: *
Disallow: /directory/file.html

Note that all other files in the specified directory will be processed.

This example tells a specific robot to stay out of a website:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
 

This example tells a specific robot not to enter one specific directory:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /private/
 

Example demonstrating how comments can be used:

# Comments appear after the "#" symbol at the start of a line, or after a directive
User-agent: * # matches all bots
Disallow: / # keep them out

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.

Example demonstrating multiple user-agents:

User-agent: googlebot        # all services
Disallow: /private/          # disallow this directory

User-agent: googlebot-news   # only the news service
Disallow: /                  # on everything

User-agent: *                # all robots
Disallow: /something/        # on this directory

 

Note: Robot.txt must be very well format.

The Robot Exclusion Standard is also known as the Robots Exclusion Protocol or robots.txt protocol. It is a convention to advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.