Robots.txt or Robots dot Text might sound simple… But in reality, it’s not as simple as it sounds.
It can be a simple text file but creating it and uploading it to your web server to instruct search engine bots is not a plain and simple deal.
You have to know the proper syntax and orders to get the most out of it.
Even a tiny little mistake can deindex your entire website or index the files that you don’t want to be found in the Search Engine Results Page (SERP).
Well… in this guide, you’ll learn about how to create a Robots.txt file to instruct the search engine or crawling bots about they how they should act against your website files.
This guide will walk you through all the necessary things that you need to know, from the definition of Robots.txt to uploading it to your server. And some alternatives to robots.txt that you might need, depending on the scenario you are in.
You can use this navigation menu to jump right into your pain points.
- Robots.txt Definition
- Different Types of Robots.txt Directives
- How to Allow Crawling of the Entire Website
- How to Disallow Crawling of the Entire Website
- How to Allow Crawling Only by Some Specific Bots
- How to Block Access to Some Specific Folders
- How to Block Access to Some Specific Files
- How to Throttle Crawl Rate to Reduce Server Load
- How to Create a Robots.txt File
- How to Upload a Robots.txt File
- How to Use Meta Robots Tag to Stop Indexing HTML Files
- How to Use X-Robots-Tag HTTP Header to Stop Indexing Non-HTML Files
So let’s get started with the definition:
What is a Robots.txt File?
Robots.txt file which is also known as the Robots Exclusion Protocol is a plain text file with a set of instructions (called directives) for web crawling robots (typically search engine robots) about how they should crawl or scan a website’s files & folders.
Robots.txt is a nice method to stop crawling & indexing of the web pages that you don’t want to appear in the search results.
But this method may not be the best to stop indexing as some of your web pages might still get indexed if some websites link to some of your disallowed web pages (which I’ll discuss later). Because the primary purpose of using the Robots.txt file is to prevent unnecessary server loads imposed by the crawling bots.
Once you know the real basics of creating a Robots.txt file, it’s very easy and less time consuming to write your own masterpiece to instruct the web crawling robots.
Note: A web crawling robot is a software designed to find links on the web typically by the search engines (although there are some spam bots like email address harvesters and many out there) like Google, Bing, Yahoo, Ask etcetera. Web crawling robots are also known as, crawling bots, web crawling spiders, web spiders, or sometimes simply as bots.
Now let me foray you into the basics of writing a Robots.txt file.
Robots.txt Basics: Understanding the Syntax
First things first, a Robots.txt file or Robots Exclusion Protocol instructs web crawling bots on what pages & files they should and shouldn’t crawl.
But before you begin learning directives …
Here’s the very basic syntax of a Robots.txt file:
In the above picture, the field name is the name of the directive and the value is what the rule is about or applies to.
Now, let’s dig deeper into the usage of directives.
Different Types of Robots.txt Directives
As you already saw that, a Robots.txt file is consist of field elements & values.
The field elements (which means directives), can be divided into 4 main categories.
And.. here they are:
The Start-of-Group element means that the element or directive which initiates a group of directives to instruct crawling bots. The only Start-of-Group element… is the:
User-agentis the name of the web crawling software that the rule is applied to. Its value is mentioned either using the asterisk (*) sign or using the name of the bot.
The group-member elements are those, that are followed by the
User-agent directive. Without the
User-agent field, the group-member elements won’t work or vice-versa.
There are three group-member elements that are used widely.
And, here they are:
- Disallow: The
Disallowelement is used to indicate the files & folders that you are not willing to be crawled by such user agents. Its value is mentioned in lower level URL paths or relative paths (using forward slashes).
- Allow: The
Allowelement is used to indicate the files & folders of a disallowed directory that you want to be crawled by such user-agents. Its value is also mentioned in lower level URL paths or relative paths. Not every search engine supports this directive but Google, Bing, Yahoo & Ask support this group-member element.
- Crawl-delay: The
Crawl-delaydirective is used to tell search engines how much time they have to wait to crawl the next webpage, file or folder. Its value is mentioned in numbers, which indicates seconds.
Crawl-delaydirective is not supported by Google. Instead, Google does offer a different settings section in which it can be configured.
A non-group element is a standalone element which doesn’t require any preceded field or directive. The following is the only non-group element:
- Sitemap: This directive is used to mention the location of the XML sitemap of your website. It’s always stated at the end of all the directives.
You can also add comments in a Robots.txt file to provide some short-explanations or to distinguish the set of directives. You can write comments by using the octothorpe sign (#) in the first place and start writing your comment. Like the following:
#This is an example comment. You can write here anything that you want.
Remember that, comments are non-technical and ignored by the crawlers.
Apart from these field elements, there are two wildcard characters you’ll find in field values.
Let’s see what are they:
Wildcards for Field Values:
- An Asterisk Sign (*): It is used to match any type & number of characters.
- Dollar Sign ($): It is used to match the end of the URLs, like, .html, .pdf, .php etcetera.
Now, let’s move further to learn how you can use the above field elements to instruct search engine bots.
Allow Crawling of the Entire Site
If you want your entire site to be crawled by all kinds of web robots, type in the following directive:
User-agent: * Disallow:
The asterisk(*) sign of the
User-agent field signifying that any kind of bot is allowed to crawl the website.
Disallow field is signifying that there’s no restriction for crawling. Meaning the robots can crawl the entire links of the website.
Stop Crawling of the Entire Website
As I already told you above that the value of
Allow directive is mentioned using forward slashes.
A single forward slash (/) indicates that it is situated at the end of the domain name.
And putting a single forward slash in the value field means that you are indicating about your entire website. No exceptions.
If you don’t want your site to be indexed in the search engine results, type in the following lines of directives.
User-agent: * Disallow: /
The above single forward slash of the
Disallow directive indicates that no content can be crawled.
Allow Crawling only by Specific Bots
It is possible to allow access to only one or a few more specific bots and block all other bots to access your contents.
To allow only Googlebot and block all other web spiders:
User-agent: Googlebot Allow: / User-agent: * Disallow: /
To allow a few known-bots and disallow all other unknown bots:
User-agent: Googlebot User-agent: bingbot Allow: / User-agent: * Disallow: /
Or you can also define every group records individually instead of like the above example.
User-agent: Googlebot Allow: / User-agent: bingbot Allow: / User-agent: * Disallow: /
Block Specific Folders or Directives
The most significant thing is that the URLs in Robots.txt mentioned in relative paths. Meaning you don’t have to write the whole URL that you want to block.
Here’s how you do it:
Suppose, you want to block crawling of a directory called folder-1, which is at http://example.com/folder-1/
You then need to insert the name of that directory between two forward slashes.
User-agent: * Disallow: /folder-1/
In such a scenario, you want to block two folders of the root directory…
Then the set of directories will be like the following:
User-agent: * Disallow: /folder-1/ Disallow: /folder-2/
Note: You don’t need to define the folders that you want access to them. The bots only take the disallowed folders into account and assume all other unmentioned folders are allowed to crawl.
Now, in case you want to allow access to a subfolder of a blocked folder…
Here’s how you instruct in that particular case:
User-agent: * Disallow: /folder-1/ Allow: /folder-1/subfolder-2/ Disallow: /folder-2/ Allow: /folder-2/subfolder-3/
Note: Always write Disallow directive first before Allow directive. This is not necessary but a good practice.
Block Specific Files
If you want to block some specific files or pages, like login pages or some exclusive pages or some kind of files that you don’t want to be indexed in search results…
Put that URL path like the following:
Remove the .html if you are hiding the extension of your webpage files.
To block some files that end with a specific file extension, first type * and then the file extension followed by a $ sign.
You can block file extensions without adding the dollar sign ($) at the end of the URL slug. It’ll help you to block crawling of the URLs with parameters also, but it may result in blocking of the file names like /news.pdf.html/, which is very unlikely to happen.
Throttle Crawl Rate of Files & Directories
Sometimes bots crawl multiple files and directories simultaneously and it may result in utilization of lots of server resource. And in that particular case…
You have to use
Crawl-delay directive to instruct crawling bots how many seconds they have to wait to crawl another page after crawling a page.
This directive is especially useful for the websites with a huge amount of web pages that publish & update contents so frequently.
You should use it if you really need it, especially if some search engine bots using a lot of resources of your server.
Here’s how you use it:
User-agent: * Disallow: /folder-name/ Crawl-delay: 10
Crawl-delay directive should sit at the last after all of the
The above directive is instructing bots to wait 10 seconds between each crawl.
Do note that the
Crawl-delay directive is not supported by Google. Instead, they provide a feature called Crawl Rate in Search Console and you can set the velocity of crawling in that section.
But, always do keep in mind that you should use crawl delay or crawl rate functionality in case such search engines really slowing down your server.
Crawl-delay directive is only supported by Bing, Yahoo & Yandex. And you can also specify this directive for every crawler individually.
How to Create a Robots.txt File
Before you hop into crafting a Robots.txt file, do check that if you already have it.
Add “/robots.txt” at the end of your domain name (like https://example.com/robots.txt) and hit return.
You’ll either get a valid Robots.txt file or a blank file or a 404 invalid page.
If you already have it, it is the default file created by your CMS at the time of installation.
If you don’t have one, then you’ll need to create it.
In either case, you’ll probably need creation or modification of your Robots.txt file.
Here’s how you make it:
Open a plain text editor on your PC (like Notepad) with UTF-8 or ASCII character encoding (no other characters are supported). Avoid using any word processor (like MS Word) as they might insert some unsupported characters, which may result in misinterpretation of your Robots.txt file.
Now, as an example, I’m gonna block three files paths of a WordPress powered site.
Here’s how it looks:
#The following rules are for every crawlers. There's no restriction. User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php/ Disallow: /newsletter.html/ Sitemap: https://makemebait.com/sitemap.xml
Now, you might wanna know that…
Why I’ve blocked two locations and allowed one specific location of a blocked location?
I’ve blocked the folder “/wp-admin/”
Because this is only for logging in to the backend of WordPress and it’ll be a waste of server resource if I do allow robots to crawl this page.
Also, it has nothing to do with any searcher’s intent.
I’ve Allowed “admin-ajax.php” File of the Blocked wp-admin Folder
I’ve Disallowed the newsletter.html page
Because that page is linked to a CTA button and exclusive to the users who have just signed up for the newsletter subscription. The page is, in fact, a Thank You Page (or a subscription confirmation page) to greet a user as a new member and hence it has nothing to do with any searcher’s intent.
Apart from these, you can also block other files & directories if you need them to.
Ask your own insight, and take the necessary action.
And… once you finish creating your the file, save it, name it robots.txt (all letters must be lowercase) and upload it by logging in to your server’s root folder via cPanel or hosting account’s file manager. You can also use an FTP client to log in, like FileZilla.
Here’s how easy it is to upload it:
How to Upload a Robots.txt File to Your Website’s Root Directory
Log in to your File Manager account (here I’m accessing it via cPanel).
Locate the public_html folder. In cPanel dashboard it can be found in the left vertical menu.
The files in that directory will be like the following:
And upload the Robots.txt file you’ve just created by clicking on the Upload button in the horizontal menu.
Now append /robots.txt at the end of your domain name like example.com/robots.txt. And you’ll be able to see your live robots file on your website.
But… here’s some problem arises.
And the problem is your disallowed pages could still get indexed, in case some websites link to some of those pages.
To avoid this issue you can add a
Noindex directive in the robots.txt file.
The proper syntax for this robots directive is to add
Noindex after the
User-agent: * Disallow: /newsletter.html/ Noindex: /newsletter.html/
This method is kinda nugget for webmasters and though Google supports it, you should not rely on it.
So here are two alternative ways to stop indexing of your disallowed pages:
Use Robots Meta Tag to Stop Indexing HTML Files
The easiest way to implement this tag is by using Yoast SEO WordPress plugin.
To do this in an individual page level, scroll down to the content editor of the page or post that you want to add
noindex. And click on the Advanced Settings gear icon of the Yoast SEO block.
And select No under “Allow search engines to show this Page in search results?”. It’ll add
noindex directive to that respective page and the page won’t appear in search results.
If you also don’t want any links on that page to be crawled by such crawler, add
nofollow meta robots directive by clicking on No under the “Should search engines follow links on this Page?”.
If you don’t wanna use any plugin or want to do it manually…
Add the following HTML Robots Meta Tag after the opening <head> tag of the page that you want to add
<meta name="robots" content="noindex">
nofollow directive along with
Do the following:
<meta name="robots" content="noindex, nofollow">
Note: Replace robots with the name of the crawler if you want to give this instruction to such specific bot.
Now, in such scenario, if you want to add
noindex to some Non-HTML files…
The Robots Meta Tag can’t help you.
So, you need to use X-Robots-Tag HTTP Header instead of Robots Meta Tag.
Here’s how you do it:
Use X-Robots-Tag to Stop Indexing Non-HTML Files
The using of the X-Robots-Tag is a bit different from using of the Meta Robots Tag.
The Meta Robots Tag is an HTML element and you can only implement it on HTML based files or pages. Whereas the X-Robots-Tag is a custom HTTP header directive which can be implemented to any type of files including folders.
Suppose you want to add
noindex to PDF-based files…
Add the following directive by editing your site’s .htaccess file, which can be found in the root directory (the public_html folder) of your site. But make sure you have backed up the .htaccess file before you start editing.
<FilesMatch ".pdf$"> Header set X-Robots-Tag "noindex" </FilesMatch>
Or follow the screenshot below to add the above code into the .htaccess file:
Now, in case you to want to
noindex such directory or folder of your website…
Then do the following:
noindex a folder, you need to add a different .htaccess file to that respective folder or directory.
First, create a simple text file on your local computer.
And… add the following code into that text file.
Header set X-Robots-Tag "noindex"
If you also want to add nofollow directive, then add the following code instead of the above code.
Header set X-Robots-Tag "noindex, nofollow"
Now, save the file, name it .htaccess and upload it to the respective folder.
noindex another folder, create an exact same .htaccess file and upload it to that respective directory.
Now you know how to use Robots.txt, Meta Robots Tag and even X-Robots-Tag HTTP Headers!
Don’t forget to give yourself a pat on the back.
- A Robots.txt file is just a set of directives to reduce server load and respecting those directives is upto the crawlers.
User-agentdirective is used to indicate the crawling bots to which the rule is applied to.
- Use the
Disallowdirective to define those files & folders that you don’t want to be crawled by such search engine.
- Use the
Allowdirective to define those files & folders that you want to be crawled of a disallowed directory.
- Put the
Allowdirective after the
- The asterisk (*) sign signifies any character, words, or numbers. Whereas the dollar sign ($) signifies the end of the URL (right after the end of the file extension name).
- Don’t use Robots.txt file to hide URLs from the search results. Instead, use Meta Robots Tag or X-Robots-Tag HTTP Header to hide them.
- Always put the
Sitemapelement at the end of the robots.txt.
Did you make your Robot.txt file?