What Is a Robots.txt File?
A robots.txt file is a set of directions that inform engines like google which pages to crawl and which pages to keep away from, guiding crawler entry however not essentially protecting pages out of Google’s index.
A robots.txt file appears like this:
Robots.txt information could seem difficult. Nevertheless, the syntax (pc language) is simple.
Earlier than explaining the main points of robots.txt, we’ll make clear how robots.txt differs from different phrases that sound related.
Robots.txt information, meta robots tags, and x-robots tags information engines like google in dealing with web site content material however differ of their degree of management, whether or not they’re positioned, and what they management.
Think about these specifics:
- Robots.txt: This file is positioned within the web site’s root listing and gives site-wide directions to go looking engine crawlers on which areas of the location they need to and shouldn’t crawl
- Meta robots tags: These tags are snippets of code within the part of particular person webpages and supply page-specific directions to engines like google on whether or not to index (embrace in search outcomes) and observe (crawl the hyperlinks on) every web page
- X-robot tags: These code snippets are used primarily for non-HTML information, corresponding to PDFs and pictures and are applied within the file’s HTTP header
Additional studying: Meta Robots Tag & X-Robots-Tag Defined
Why Is Robots.txt Necessary for search engine marketing?
A robots.txt file is necessary for search engine marketing as a result of it helps handle net crawler actions to forestall them from overloading your web site and crawling pages not supposed for public entry.
Under are a couple of causes to make use of a robots.txt file:
1. Optimize Crawl Funds
Blocking pointless pages with robots.txt permits Google’s net crawler to spend extra crawl finances (what number of pages Google will crawl in your web site inside a sure timeframe) on pages that matter.
Crawl finances can range based mostly in your web site’s measurement, well being, and variety of backlinks.
In case your web site has extra pages than its crawl finances, necessary pages could fail to get listed.
Unindexed pages gained’t rank, which implies you’ve wasted time creating pages that customers by no means see in search outcomes.
2. Block Duplicate and Non-Public Pages
Not all pages are supposed for inclusion within the search engine outcomes pages (SERPs), and a robots.txt file enables you to block these private pages from crawlers.
Think about staging websites, inside search outcomes pages, duplicate pages, or login pages. Some content material administration techniques deal with these inside pages mechanically.
WordPress, for instance, disallows the login web page “/wp-admin/” for all crawlers.
3. Disguise Sources
Robots.txt enables you to exclude sources like PDFs, movies, and pictures from crawling if you wish to maintain them non-public or have Google concentrate on extra necessary content material.
How Does a Robots.txt File Work?
A robots.txt file tells search engine bots which URLs to crawl and (extra importantly) which URLs to keep away from crawling.
When search engine bots crawl webpages, they uncover and observe hyperlinks. This course of leads them from one web site to a different throughout varied pages.
If a bot finds a robots.txt file, it reads that file earlier than crawling any pages.
The syntax is simple. You assign guidelines by figuring out the user-agent (the search engine bot) and specifying directives (the principles).
You should use an asterisk (*) to assign directives to all user-agents without delay.
For instance, the next instruction permits all bots besides DuckDuckGo to crawl your web site:
Semrush bots crawl the net to collect insights for our web site optimization instruments, corresponding to Website Audit, Backlink Audit, and On Web page search engine marketing Checker.
Semrush bots respect the principles in your robots.txt file, which means they gained’t crawl your web site for those who block Semrush bots from crawling.
Nevertheless, blocking Semrush bots limits using sure Semrush instruments.
For instance, for those who block the SiteAuditBot from crawling your web site, you may’t audit your web site with the Website Audit device. This device helps analyze and repair technical points in your web site.
Should you block the SemrushBot-SI from crawling your web site, you may’t use the On Web page search engine marketing Checker device successfully.
Because of this, you lose the chance to generate optimization concepts that would enhance your webpages’ rankings.
Methods to Discover a Robots.txt File
Your robots.txt file is hosted in your server, identical to different information in your web site.
You may view any web site’s robots.txt file by typing the location’s homepage URL into your browser and including “/robots.txt” on the finish.
For instance: “https://semrush.com/robots.txt.”
Examples of Robots.txt Information
Listed here are some real-world robots.txt examples from in style web sites.
YouTube
YouTube’s robots.txt file tells crawlers to not entry consumer feedback, video feeds, login/signup pages, and age verification pages.
The principles in YouTube’s robots.txt file discourages indexing user-specific or dynamic content material that doesn’t assist search outcomes and should increase privateness considerations.
G2
G2’s robots.txt file tells crawlers to not entry sections with user-generated content material, like survey responses, feedback, and contributor profiles.
The principles in G2’s robots.txt information assist shield consumer privateness by proscribing entry to doubtlessly delicate private data. The principles additionally forestall makes an attempt to control search outcomes.
Nike
Nike’s robots.txt file makes use of the disallow directive to dam crawlers from accessing user-generated directories, like “/checkout/” and “*/member/inbox.”
The principles in Nike’s robots.txt file forestall delicate consumer information from showing in search outcomes and scale back alternatives to control search engine marketing rankings.
Search Engine Land
Search Engine Land’s robots.txt file makes use of the disallow tag to discourage indexing of “/tag/” listing pages, which frequently have low search engine marketing worth and might trigger duplicate content material points.
The principles in Search Engine Land’s robots.txt file encourage engines like google to concentrate on higher-quality content material and optimize the location’s crawl finances—one thing particularly necessary for big web sites like Search Engine Land.
Forbes
Forbes’s robots.txt file instructs Google to not crawl the “/take a look at/” listing, which seemingly incorporates testing or staging environments.
The principles in Forbes’s robots.txt file forestall unfinished or delicate content material from being listed, assuming it’s not linked to from elsewhere.
Explaining Robots.txt Syntax
A robots.txt file consists of a number of directive blocks, with every block specifying a user-agent (a search engine bot) and offering “enable” or “disallow” directions.
A easy block could seem like this:
Person-agent: Googlebot
Disallow: /not-for-google
Person-agent: DuckDuckBot
Disallow: /not-for-duckduckgo
Sitemap: https://www.yourwebsite.com/sitemap.xml
The Person-Agent Directive
The primary line of every directive block specifies the user-agent, which identifies the crawler.
For instance, use these traces to forestall Googlebot from crawling your WordPress admin web page:
Person-agent: Googlebot
Disallow: /wp-admin/
When a number of directives exist, a bot could select probably the most particular one.
Think about you’ve three units of directives: one for *, one for Googlebot, and one for Googlebot-Picture.
If the Googlebot-Information consumer agent crawls your web site, it would observe the Googlebot directives.
Nevertheless, the Googlebot-Picture consumer agent will observe the extra particular Googlebot-Picture directives.
The Disallow Robots.txt Directive
The disallow directive lists elements of the location a crawler shouldn’t entry.
An empty disallow line means no restrictions exist.
For instance, the beneath rule permits all crawlers entry to your whole web site:
Person-agent: *
Enable: /
To dam all crawlers out of your whole web site, use the beneath block:
Person-agent: *
Disallow: /
The Enable Directive
The enable directive permits engines like google to crawl a subdirectory or particular web page, even in an in any other case disallowed listing.
For instance, use the beneath rule to forestall Googlebot from accessing all weblog posts besides one:
Person-agent: Googlebot
Disallow: /weblog
Enable: /weblog/example-post
The Sitemap Directive
The sitemap directive tells engines like google—particularly Bing, Yandex, and Google—the place to seek out your XML sitemap (a file that lists all of the pages you need engines like google to index).
The picture beneath exhibits what the sitemap directive appears like:
Together with a sitemap directive in your robots.txt file is a fast strategy to share your sitemap.
Nevertheless, you must also submit your XML sitemap on to engines like google by way of their webmaster instruments to hurry up crawling.
The Crawl-Delay Directive
The crawl-delay directive tells crawlers what number of seconds to attend between requests, which helps keep away from server overload.
Google now not helps the crawl-delay directive. To set crawl charges for Googlebot, use Google Search Console.
Bing and Yandex do assist the crawl-delay directive.
For instance, use the beneath rule to set a 10-second delay after every crawl motion:
Person-agent: *
Crawl-delay: 10
Additional studying: 15 Crawlability Issues & Methods to Repair Them
The Noindex Directive
A robots.txt file tells engines like google what to crawl and what to not crawl however can’t reliably maintain a URL out of search outcomes—even for those who use a noindex directive.
Should you use noindex in robots.txt, the web page can nonetheless seem in search outcomes with out seen content material.
Google by no means formally supported the noindex directive in robots.txt and confirmed so in September 2019.
To reliably exclude a web page from search outcomes, use a meta robots noindex tag as a substitute.
Methods to Create a Robots.txt File
Use a robots.txt generator device to rapidly create a robots.txt file.
Comply with these steps to create a robotx.txt file from scratch:
1. Create a File and Title It Robots.txt
Open a .txt doc in a textual content editor or net browser.
Title the doc “robots.txt.”
Now you can begin typing directives.
2. Add Directives to the Robots.txt File
A robots.txt file incorporates a number of teams of directives, and every group contains a number of traces of directions.
Every group begins with a user-agent and specifies:
- Who the group applies to (the user-agent)
- Which directories (pages) or information the agent ought to entry
- Which directories (pages) or information the agent shouldn’t entry
- A sitemap (non-compulsory) to inform engines like google which pages and information you deem necessary
Crawlers ignore traces that don’t match the above directives.
Think about you don’t need Google to crawl your “/shoppers/” listing as a result of it’s for inside use solely.
The primary group in your file would seem like this block:
Person-agent: Googlebot
Disallow: /shoppers/
You may add extra directions for Google after that, just like the one beneath:
Person-agent: Googlebot
Disallow: /shoppers/
Disallow: /not-for-google
Then press enter twice to start out a brand new group of directives.
Now think about you need to forestall entry to “/archive/” and “/assist/” directories for all engines like google.
A block stopping entry to these :
Person-agent: Googlebot
Disallow: /shoppers/
Disallow: /not-for-google
Person-agent: *
Disallow: /archive/
Disallow: /assist/
When you’re completed, add your sitemap:
Person-agent: Googlebot
Disallow: /shoppers/
Disallow: /not-for-google
Person-agent: *
Disallow: /archive/
Disallow: /assist/
Sitemap: https://www.yourwebsite.com/sitemap.xml
Save the file as “robots.txt.”
3. Add the Robots.txt File
After saving your robots.txt file, add the file to your web site so engines like google can discover it.
The method of importing your robots.txt file relies on your internet hosting surroundings.
Search on-line or contact your internet hosting supplier for particulars.
For instance, search “add robots.txt file to [your hosting provider] ” for platform-specific directions.
Under are some hyperlinks explaining easy methods to add robots.txt information to in style platforms:
After importing, verify that the file is accessible and that Google can learn it.
4. Check Your Robots.txt File
First, confirm that anybody can view your robots.txt file by opening a non-public browser window and coming into your sitemap URL.
For instance, “https://semrush.com/robots.txt.”
Should you see your robots.txt content material, take a look at the markup.
Google gives two testing choices:
- The robots.txt report in Search Console
- Google’s open-source robots.txt library (superior)
Use the robots.txt report in Search Console in case you are not a complicated consumer.
Open the robots.txt report.
Should you haven’t linked your web site to Search Console, add a property and confirm web site possession first.
If you have already got verified properties, choose one from the drop-down after opening the robots.txt report.
The device studies syntax warnings and errors.
Edit errors or warnings instantly on the web page and retest as you go.
Adjustments made inside the robots.txt report aren’t saved to your web site’s dwell robots.txt file, so copy and paste corrected code into your precise robots.txt file.
Semrush’s Website Audit device can even examine for robots.txt points.
Arrange a mission and run an audit.
When the device is prepared, navigate to the “Points” tab and seek for “robots.txt.”
Click on “Robots.txt file has format errors” if it seems.
View the checklist of invalid traces.
Click on “Why and easy methods to repair it” for particular directions.
Examine your robots.txt file frequently. Even small errors can have an effect on your web site’s indexability.
Robots.txt Finest Practices
Use a New Line for Every Directive
Place every directive by itself line to make sure engines like google can learn them and observe the directions.
Incorrect instance:
Person-agent: * Disallow: /admin/
Disallow: /listing/
Right instance:
Person-agent: *
Disallow: /admin/
Disallow: /listing/
Use Every Person-Agent Solely As soon as
Checklist every user-agent as soon as to maintain the file organized and scale back the danger of human error.
Complicated instance:
Person-agent: Googlebot
Disallow: /example-page
Person-agent: Googlebot
Disallow: /example-page-2
Clear instance:
Person-agent: Googlebot
Disallow: /example-page
Disallow: /example-page-2
Writing all directives underneath the identical user-agent is cleaner and helps you keep organized.
Use Wildcards to Make clear Instructions
Use wildcards (*) to use directives broadly.
To stop engines like google from accessing URLs with parameters, you possibly can technically checklist them out one after the other.
Nevertheless, you may simplify your instructions with a wildcard.
Inefficient instance:
Person-agent: *
Disallow: /footwear/vans?
Disallow: /footwear/nike?
Disallow: /footwear/adidas?
Environment friendly instance:
Person-agent: *
Disallow: /footwear/*?
The above instance blocks all search engine bots from crawling all URLs underneath the “/footwear/” subfolder with a query mark.
Use ‘$’ to Point out the Finish of a URL
Use “$” to point the tip of a URL.
To dam engines like google from crawling all of a sure file kind, utilizing “$” helps you keep away from itemizing all of the information individually.
Inefficient:
Person-agent: *
Disallow: /photo-a.jpg
Disallow: /photo-b.jpg
Disallow: /photo-c.jpg
Environment friendly:
Person-agent: *
Disallow: /*.jpg$
Use “$” fastidiously as a result of errors can result in unintended unblocking.
Add feedback by beginning a line with “#”—crawlers ignore something that begins with a hash.
For instance:
Person-agent: *
#Touchdown Pages
Disallow: /touchdown/
Disallow: /lp/
#Information
Disallow: /information/
Disallow: /private-files/
#Web sites
Enable: /web site/*
Disallow: /web site/search/*
Builders typically add humorous feedback utilizing hashes since most customers by no means see the file.
For instance, YouTube’s robots.txt file reads: “Created within the distant future (the 12 months 2000) after the robotic rebellion of the mid 90’s which worn out all people.”
And Nike’s robots.txt reads “simply crawl it” (a nod to its “simply do it” tagline) and options the model’s brand.
Use Separate Robots.txt Information for Completely different Subdomains
Robots.txt information solely management crawling on the subdomain the place they reside, which implies it’s possible you’ll want a number of information.
In case your web site is “area.com” and your weblog is “weblog.area.com,” create a robots.txt file for each the area’s root listing and the weblog’s root listing.
5 Robots.txt Errors to Keep away from
When creating your robots.txt file, be careful for the next frequent errors:
1. Not Together with Robots.txt within the Root Listing
Your robots.txt file should be positioned in your web site’s root listing to make sure search engine crawlers can discover it simply.
For instance, in case your web site’s homepage is “www.instance.com,” place the file at “www.instance.com/robots.txt.”
Should you put it in a subdirectory, like “www.instance.com/contact/robots.txt,” engines like google could not discover it and will assume you haven’t set any crawling directions.
2. Utilizing Noindex Directions in Robots.txt
Don’t use noindex directions in robots.txt—Google doesn’t assist the noindex rule within the robots.txt file.
As an alternative, use meta robots tags (e.g.,) on particular person pages to manage indexing.
3. Blocking JavaScript and CSS
Keep away from blocking entry to JavaScript and CSS information by way of robots.txt except vital (e.g., proscribing entry to delicate information).
Blocking crawling of JavaScript and CSS information makes it tough for engines like google to know your web site’s construction and content material, which might hurt your rankings.
Additional studying: JavaScript search engine marketing: Methods to Optimize JS for Search Engines
4. Not Blocking Entry to Your Unfinished Website or Pages
Block engines like google from crawling unfinished variations of your web site to maintain it from being discovered earlier than you’re prepared (additionally use a meta robots noindex tag for every unfinished web page).
Serps crawling and indexing an in-development web page could result in a poor consumer expertise and potential duplicate content material points.
Utilizing robots.txt to maintain unfinished content material non-public till you’re able to launch.
5. Utilizing Absolute URLs
Use relative URLs in your robots.txt file to make it simpler to handle and keep.
Absolute URLs are pointless and might trigger errors in case your area modifications.
❌ Instance with absolute URLs (not beneficial):
Person-agent: *
Disallow: https://www.instance.com/private-directory/
Disallow: https://www.instance.com/temp/
Enable: https://www.instance.com/important-directory/
✅ Instance with relative URLs (beneficial):
Person-agent: *
Disallow: /private-directory/
Disallow: /temp/
Enable: /important-directory/
Maintain Your Robots.txt File Error-Free
Now that you just perceive how robots.txt information work, it’s best to guarantee yours is optimized. Even small errors can have an effect on how your web site is crawled, listed, and displayed in search outcomes.
Semrush’s Website Audit device makes analyzing your robots.txt file for errors straightforward and gives actionable suggestions to repair any points.