What is Robots.txt?
Robots.txt is a file associated with your website used to ask different web crawlers to crawl or not crawl portions of your website.
The robots.txt file is primarily used to specify which parts of your website should be crawled by spiders or web crawlers. It can specify different rules for different spiders.
Googlebot is an example of a spider. It’s deployed by Google to crawl the Internet and record information about websites so it knows how high to rank different websites in search results.
Using a robots.txt file with your website is a web standard. Spiders look for the robots.txt file in the host directory (or main folder) of your website. This text file is always named “robots.txt”. You can find your robots.txt file by going to:
yourwebsite.com/robots.txt
Most mainstream spiders comply with directions specified in robots.txt files but nefarious spiders may not. The content within robot.txt files are publicly available. You can attempt to ban unwanted spiders by editing the .htaccess file associated with your website.
It’s important that marketers check their robots.txt file to make sure search engines are invited to crawl important pages. If you ask search engines to not crawl your website, then your website won’t appear in search results.
You can also use the robots.txt file to show spiders where to find a sitemap of your website, which can make your content more discoverable.
You can also specify a crawl-delay, or how many seconds robots should wait before collecting more information. Some websites may need to use this setting if bots are eating up bandwidth and causing your website to load slower for human visitors.
An Example Robots.txt File
Here is what might appear in a robots.txt file:
User-agent: *
Disallow: /ebooks/*.pdf
Disallow: /staging/
User-agent: Googlebot-Image
Disallow: /images/
Here is what each line means in plain English.
User-agent: * — The first line is explaining that the rules that follow should be followed by all web crawlers. The asterisk means all spiders in this context.
Disallow: /ebooks/*.pdf — In conjunction with the first line, this link means that all web crawlers should not crawl any pdf files in the ebooks folder within this website. This means search engines won’t include these direct PDF links in search results.
Disallow: /staging/ —In conjunction with the first line, this line asks all crawlers not to crawl anything in the staging folder of the website. This can be helpful if you’re running a test and don’t want the staged content to appear in search results.
User-agent: Googlebot-Image — This explains that the rules that follow should only be followed by just one specific crawler, the Google Image crawler. Each spider uses a different “user-agent” name.
Disallow: /images/ — In conjunction with the line immediately above this one, this asks the Google Images crawler not to crawl any images in the images folder.
Robots.txt Resources
- Robots.txt FAQ
- Google’s Robots.txt Guide
- Robots.txt File Generator
- Understanding and Editing Your Robots.txt File
- What’s the best way to block bad robots?
Synonyms
- Robots Exclusion Standard