What Is Robots.txt For A Web Site
Robots.txt: Article will show you a detail idea about robots.txt file for the website with its function and structure. Robots.txt is also known as Robots Exclusion Standard (RES) or Robots Exclusion Protocol (REP) or robots.txt protocol.
It is good when the search engine regularly visit your web site (Web Crawling/Web Indexing) to collect data. But sometimes we don’t want spider to crawl some of our web pages to avoid duplicate content or some other reason. Web site owner also want to exclude printing materials or other sensitive data to be crawled. On that case we create a robots exclusion standard (Robots.txt).
What is Robots.txt: Robots.txt is a standard used by the website to communicate with the web crawlers or web robots. This robots exclusion protocol (REP) instruct the web crawler how to index the website. This will tell the web crawler which pages you would like them not to visit and which page to index.
Robots.txt is a text file (Not HTML) placed on the website. This is not a mandatory thing for the website, But Search Engine wants what they are asked not to do. This is only for the web spider. It is not like a firewall or a kind of password protection to enter your website.
If any website has no robots.txt file then Spider will index all the webpages of that website. Location and structure of the robots.txt is very important. If you are doing that wrong then it is of no use. So give a special attention towards the location structure of the robots.txt file.
Location of the robots.txt file
Robots.txt file location: Location of the robots exclusion protocol (REP) or robots.txt file must be the main directory. Otherwise it will be difficult to get that file for Web Crawler. Search Robot will not search the entire website to get robots.txt file. So better tu put it on the main directory (i.e. [webdomain]/robots.txt). If the crawler don’t find the file there, then it will assume that, website has no robots.txt file and index the entire website.
For the website http://rocknik.com/mihir/, Location of the Robots.txt file is http://rocknik.com/mihir/robots.txt
To create a robots.txt file just follow the below described structure.
Structure Of Robots.txt File in Web Site
Robots.txt File Structure: The structure of a robots.txt is pretty simple (and barely flexible) – You can add an endless list of user agents and disallowed files and directories. Basically the syntax is as follows
User-agent are the search engine crawlers. Diasllow are the list of files and directories excluded from the search engine indexing. To put a command line for yourself just put a “#” symbol without quotes at the beginning of the line.
# All user agents are disallowed to see the /temp directory.
Note: You can add any number of user agent, disallow and comment line to the robots.txt file. When adding multiple things give a special attention to avoid mistakes.
Common Mistakes In Creating A Robots.txt File
- Wrong placement of the robots.txt file. Always place it on the main directory.
- Avoid hyphen between user and agent (i.e. user agent:). This is wrong and the correct structure is user-agent:
- Take care on putting slashes and colons
- Use Correct URL path for the directory.