Are you blocking important pages from being crawled? The robots exclusion standard allows webmasters to block crawlers from certain pages/sections of their site, but when used incorrectly, a great deal of harm can be caused to your site. Tom Williams is here to provide more information.
The Robots Exclusion Standard is a protocol that provides a way of communicating with web crawlers or robots to inform them of any pages/sections on a website that they aren’t allowed to crawl or visit.
This protocol also has the mechanism for specifying ‘inclusion’ as well as ‘exclusion’.
Three primary methods of using this protocol:
A couple of examples when this protocol should be used:
It’s important to note that incorrectly blocking access to important pages on a website will cause harm and organic rankings may drop as a result. This is because you’re informing the search engine not to crawl or index a specific page or section of your website.
Therefore, the search engine won’t know what the page is about as it’s unable to crawl the page (blocked via robots.txt) or your website will drop out of the index all together (blocked via meta noindex).
As detailed above, there are three main ways to use the robots exclusion protocol. Let’s take a look at each one of these in more detail:
A robots.txt file is placed in the top-level directory of a web server - http://www.example.com/robots.txt. This is the first place a web crawler will visit when crawling a website.
Using the robots.txt file informs search engines not to crawl the provided URL but they may index the page within the search results.
A few common examples of how to use the file are shown below:
Block all web crawlers from all content
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Disallow indexing of a specific folder
Block a specific web crawler from indexing a specific folder
Block a specific web crawler from indexing a specific web page
Allow indexing of everything
Most crawlers will view your robots.txt file first, so having a sitemap referenced here will guide them through your websites structure more efficiently.
To block access to all URLs that include a question mark (?), you could use the following entry:
You can use the $ character to specify matching the end of the URL. For instance, to block any URLs that end with .html, you could use the following entry:
This can either be done within the robots.txt file or within the source code of a specific page.
<meta name="robots" content="noindex">
Using this directive informs search engines they can visit the specified page but aren’t allowed to display it within the search results.
For example, you don’t necessarily want internal search results pages ranking within a search engine but still want the page strength to be passed through them.
This method varies from using the robots.txt file because Google won’t display the specified page within the search results whereas a search engine may still index a page being blocked via the robots.txt file.
If both a noindex tag and a robots.txt directive are in use, the robots.txt file is the primary command. This is because the crawler is being told it can’t crawl the given page so isn’t able to find the noindex tag within the source code.
Placing a nofollow tag on a page prevents crawlers from following links on this page.
Search engines may still be able to discover pages that have a nofollow directive associated with them via links from other pages.
Also, if a page on a website has a good level of strength being passed to it and a nofollow tag is added, this strength will be lost.
Things to Note
Worried your Robots.txt could be blocking important pages from being crawled? Ask the experts. Contact our technical SEO team today.