Are you blocking important pages from being crawled? The robots exclusion standard allows webmasters to block crawlers from certain pages/sections of their site, but when used incorrectly, a great deal of harm can be caused to your site. Tom Williams is here to provide more information.
What is the Robots Exclusion Standard & Why Use it?
The Robots Exclusion Standard is a protocol that provides a way of communicating with web crawlers or robots to inform them of any pages/sections on a website that they aren’t allowed to crawl or visit.
This protocol also has the mechanism for specifying ‘inclusion’ as well as ‘exclusion’.
Three primary methods of using this protocol:
- Robots.txt File
- Robots Meta NoIndex tag
- Robots Meta NoFollow Tag
A couple of examples when this protocol should be used:
- Block web crawlers from crawling your website while on its staging server
- There may be content on your website that you want blocked from search engines
- Internal search pages are a prime example
It’s important to note that incorrectly blocking access to important pages on a website will cause harm and organic rankings may drop as a result. This is because you’re informing the search engine not to crawl or index a specific page or section of your website.
Therefore, the search engine won’t know what the page is about as it’s unable to crawl the page (blocked via robots.txt) or your website will drop out of the index all together (blocked via meta noindex).
How to Use The Robots Exclusion Protocol
As detailed above, there are three main ways to use the robots exclusion protocol. Let’s take a look at each one of these in more detail:
1. Robots.txt File
A robots.txt file is placed in the top-level directory of a web server – http://www.example.com/robots.txt. This is the first place a web crawler will visit when crawling a website.
Using the robots.txt file informs search engines not to crawl the provided URL but they may index the page within the search results.
A few common examples of how to use the file are shown below:
Block all web crawlers from all content
The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.
Disallow indexing of a specific folder
Block a specific web crawler from indexing a specific folder
Block a specific web crawler from indexing a specific web page
Allow indexing of everything
Most crawlers will view your robots.txt file first, so having a sitemap referenced here will guide them through your websites structure more efficiently.
To block access to all URLs that include a question mark (?), you could use the following entry:
You can use the $ character to specify matching the end of the URL. For instance, to block any URLs that end with .html, you could use the following entry:
2. Block with Meta NoIndex
This can either be done within the robots.txt file or within the source code of a specific page.
<meta name=”robots” content=”noindex”>
Using this directive informs search engines they can visit the specified page but aren’t allowed to display it within the search results.
For example, you don’t necessarily want internal search results pages ranking within a search engine but still want the page strength to be passed through them.
This method varies from using the robots.txt file because Google won’t display the specified page within the search results whereas a search engine may still index a page being blocked via the robots.txt file.
If both a noindex tag and a robots.txt directive are in use, the robots.txt file is the primary command. This is because the crawler is being told it can’t crawl the given page so isn’t able to find the noindex tag within the source code.
3. Block by Nofollow Links
Placing a nofollow tag on a page prevents crawlers from following links on this page.
Search engines may still be able to discover pages that have a nofollow directive associated with them via links from other pages.
Also, if a page on a website has a good level of strength being passed to it and a nofollow tag is added, this strength will be lost.
Robots Exclusion Standard Summary
Things to Note
- These commands are only directives and may not be honoured
- Blocking access to important pages on a website will cause harm and organic rankings may drop as a result
- Different crawlers may interpret the commands differently
- Malicious crawlers are likely to ignore your exclusion protocols
- Each subdomain on a root domain uses separate robots.txt files
Worried your Robots.txt could be blocking important pages from being crawled? Ask the experts. Contact our technical SEO team today.