You might have heard about Google crawling our website and the pages on the Internet and then displaying them on their search results page.
Well, the robots.txt tells Google which sites are following their algorithms and which are not. In this guide, we have shared everything you need to know about robots.txt, examples, and how they work.
What is Robots.txt?
Robots.txt is a webmaster's text file to guide web robots (usually search engine robots) on crawling pages on their domain. A robots.txt file, in other words, is a set of instructions for bots.
Robots.txt are included in the source code for most websites.
The robots.txt file is part of the Robots Exclusion Protocol (REP). It defines how robots crawl web pages, find and index content, and serve that information to people who want it.
Robots.txt help search engine bots understand which URLs they should crawl on your site.
Why is Robots.txt Important?
There are mainly three reasons as to how.txt can help you:
- Maximize Crawl Budget: The crawl budget is the number of pages that Google bots index in a timeframe. Robots.txt can help block unimportant or duplicate content pages and only focus on pages that are important to you, maximizing your crawl budget and increasing visibility.
- Avoid Resources Indexing: Robots.txt can index or de-index the resources such as PDFs or images. Thus, noindex or password-protected pages can be used to block the pages or resources. You can check the indexed pages using the Google search console and see if bots crawl the pages you want to index.
- Block Non-important Pages: There are some pages on your website that you don't want to show on Google search results. These might be the login page or your website’s staged/tester version. Using robots.txt, you can block these pages and only focus on the pages that matter.
Robots.txt Examples
Some examples of robots.txt are:
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
Similarly, for the Bing search engine, it goes:
User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html
The Syntax helps the crawler avoid the crawling of a certain page.
It's worth noting that each subdomain needs its own robots.txt file.
For example, while www.cloudflare.com has its own file, all Cloudflare subdomains (blog.cloudflare.com, community.cloudflare.com, and so on) require their own as well.
How Does a Robots.txt File Work?
The robots.txt file can be used to implement several Search Engine Optimization (SEO) techniques, like not-indexed pages or robots txt disallow all or specific parts of the site by bot programs.
The standard is particularly valuable for sites that wish to prevent spiders from indexing their content via automated processes such as searching and page ranking software using automatic web crawlers.
A robots.txt is a file with no HTML markup code. It is hosted on the webserver, just like other files on your website.
It can be accessed by entering the homepage URL followed by /robots.txt. A general example is https://www.xyz.com/robots.txt.
Because the file isn't linked anywhere else on the site, visitors are unlikely to come across it, but most web crawler bots will look for it before indexing the rest of the site.
A good bot, such as a Google crawler or a news feed bot, will read the robots.txt file first before examining any other pages on a site and obey the instructions.
A malicious bot will either ignore or process the robots.txt file to find the banned web pages.
Robots.txt Blocking
Robots.txt consists of instructions to inform the robots of any blocking rules on an otherwise searchable website indexed by Googlebot, which are intended to prevent the crawler from accessing pages with certain content.
It does not affect regular users or bots that just browse sites without crawling them.
Over time, websites have blocked mobile apps, JavaScript, and other parts of their site using this method while still allowing some elements like images.
What Protocols are Used in a Robots.txt File?
A protocol is a format for transmitting instructions or orders in networking. Robots.txt files employ a variety of protocols. The primary protocol is known as the Robots Exclusion Protocol.
It instructs bots on which websites and resources to avoid.
The sitemaps protocol is another protocol that is used for robots.txt files. This can be thought of as a protocol for robot inclusion.
Sitemaps inform web crawlers about which pages they can access. This helps ensure that a crawler bot does not overlook any crucial pages.
What is Sitemap?
The sitemap is an XML file that describes the information used by web crawlers to display your site's pages. The sitemap lists all of the URLs on your website, including their titles and descriptions, and other related information.
This document helps search engines index these items correctly so that visitors can find them quickly using search engines or other navigation systems.
What is a User-Agent?
User-agent user-agent: *disallow: /nogooglebot/ user-agent user-agent: Googlebot
Disallow all robots.txt rules, regardless of user-agent. This will prevent any web crawler from accessing your site. This is most often used by websites sensitive to privacy concerns (and therefore do not want their users' data exposed).
Common search engine bot user agent names include:
Google:
- Googlebot
- Googlebot-Image (for images)
- Googlebot-News (for news)
- Googlebot-Video (for video)
Bing:
- Bingbot
- MSNBot-Media (for images and video)
Baidu:
- Baiduspider
What is a .txt File?
TXT is a text file extension supported by many text editors.
There is no such static definition of a text file, while there are several popular formats, including ASCII (a cross-platform format) and ANSI (used on DOS and Windows platforms). TXT is an abbreviation for TeXT. Text/plain is the MIME type.
In the text file robots.txt, each rule specifies a pattern of URLs that can be accessed by all crawlers or specified crawlers only.
Each line should begin with a single colon (:) and contain 1–3 lines, ignoring blank lines. Each line is interpreted as a filename, which can be absolute or relative to the directory in which it is placed.
Are Web Robots the Same as Robots.txt?
Some search engines may not support txt directives.
The instructions in robots.txt files cannot compel crawlers to visit your site; it is up to the crawler to follow them. In contrast, Googlebot and other well-known web crawlers follow the rules in a robots.txt file.
How to Implement Robots.txt?
A robots.txt file can be implemented in nearly any text editor. Notepad, TextEdit, vi, and emacs, for example, may all generate legitimate robots.txt files.
The following are the rules to follow robots.txt:
- The file name should be robots.txt.
- A robots.txt file can be used to restrict access to subdomains (for example, https://website.example.com/robots.txt) or non-standard ports.
- A robots.txt file must be in UTF-8 format (which includes ASCII). Google may reject characters not in the UTF-8 range, potentially invalidating robots.txt regulations.
- The robots.txt file is usually found in the root directory of the website host to which it is applied. To enable crawling on all URLs such as this- https://www.example.com/, the robots.txt file must be found at https://www.example.com/robots.txt. It cannot be placed in a subdirectory (e.g., https://example.com/pages/robots.txt).
Limitations of a Robots.txt File
Here are the limitations of the robots.txt file:
- All search engines do not support Robots.txt: Robots.txt files cannot compel crawlers to visit your site; it is up to the crawler to follow them. While Googlebot and other trustworthy web crawlers will follow the instructions in a robots.txt file, other crawlers may not.
- Different crawlers interpret syntax differently: Although reputable web spiders adhere to the directives in a robots.txt file, each crawler may interpret the directives differently.
- A disallowed page can be indexed if it is linked to other pages: While Google will not crawl or index content that a robots.txt file has restricted, it may find and index a disallowed URL if it is linked from other locations on the internet.
As a result, the URL address and perhaps other publicly available information such as anchor text in links to the page may still appear in Google search results.
FAQ
Q1. Where does robots.txt go on a site?
Ans: A robots.txt file is typically placed in the root directory of your website, but it can be anywhere on your site that you want to restrict access.
Q2. Is a robots.txt file necessary?
Ans: The short answer is no. A robots.txt file isn't necessary for a website. If a bot visits your website and it does not have a robots.txt file, it will crawl and index pages as it normally would. The .txt file is only required if you want more control over what gets crawled.
Q3. Is robots.txt safe?
Ans: The robots.txt file is not a security risk in and of itself, and its proper use can represent good practice for non-security reasons. You should not expect that all web robots will follow the instructions in the file.
Q4. Is it illegal to access robots.txt?
Ans: A robots.txt file is a license that is implied by the website owner. If you are aware of the robots.txt file, then continuing to scrape their site without permission could be seen as unauthorized access or hacking.
Q5. What is crawl delay in robots txt?
Ans: The crawl-delay directive is a way to tell crawlers to slow down so that the webserver isn't overloaded.
Conclusion
Robots.txt is a simple file that has a lot of power. If you know how to use it well, it can help SEO. Creating the right type of robots.txt means that you are improving your SEO and user experience as well.
Bots will be able to present your content in the SERPs the way you want it to be seen if you allow them to crawl the correct things.
If you are willing to learn more about SEO and important factors, check out many other Scalenut blogs.