Are you blocking important pages from being crawled? The robots exclusion standard allows webmasters to block crawlers from certain pages/sections of their site, but when used incorrectly, a great deal of harm can be caused to your site. Tom Williams is here to provide more information.

What is the Robots Exclusion Standard & Why Use it?

The Robots Exclusion Standard is a protocol that provides a way of communicating with web crawlers or robots to inform them of any pages/sections on a website that they aren’t allowed to crawl or visit.

This protocol also has the mechanism for specifying ‘inclusion’ as well as ‘exclusion’.

Three primary methods of using this protocol:

  1. Robots.txt File
  2. Robots Meta NoIndex tag
  3. Robots Meta NoFollow Tag

A couple of examples when this protocol should be used:

  • Block web crawlers from crawling your website while on its staging server
  • There may be content on your website that you want blocked from search engines
    • Internal search pages are a prime example

It’s important to note that incorrectly blocking access to important pages on a website will cause harm and organic rankings may drop as a result. This is because you’re informing the search engine not to crawl or index a specific page or section of your website.

Therefore, the search engine won’t know what the page is about as it’s unable to crawl the page (blocked via robots.txt) or your website will drop out of the index all together (blocked via meta noindex).

How to Use The Robots Exclusion Protocol

As detailed above, there are three main ways to use the robots exclusion protocol. Let’s take a look at each one of these in more detail:

1. Robots.txt File

A robots.txt file is placed in the top-level directory of a web server – http://www.example.com/robots.txt. This is the first place a web crawler will visit when crawling a website.

Using the robots.txt file informs search engines not to crawl the provided URL but they may index the page within the search results.

A few common examples of how to use the file are shown below:

Block all web crawlers from all content

User-agent: *

Disallow: /

The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

Disallow indexing of a specific folder

User-agent: *

Disallow: /folder/

Block a specific web crawler from indexing a specific folder

User-agent: Googlebot

Disallow: /folder/

Block a specific web crawler from indexing a specific web page

User-agent: Googlebot

Disallow: /folder/blocked-page.html

Allow indexing of everything

User-agent: *

Disallow:

Or

User-agent: *

Allow: /

Sitemap Parameter

User-agent: *

Disallow:

Sitemap: http://www.example.com/none-standard-location/sitemap.xml

Most crawlers will view your robots.txt file first, so having a sitemap referenced here will guide them through your websites structure more efficiently.

Wildcards

To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *

Disallow: /*?

You can use the $ character to specify matching the end of the URL. For instance, to block any URLs that end with .html, you could use the following entry:

User-agent: Googlebot

Disallow: /*.html$ 

2. Block with Meta NoIndex

This can either be done within the robots.txt file or within the source code of a specific page.

User-agent: Googlebot

Noindex: /page-two/

<meta name=”robots” content=”noindex”>

Using this directive informs search engines they can visit the specified page but aren’t allowed to display it within the search results.

For example, you don’t necessarily want internal search results pages ranking within a search engine but still want the page strength to be passed through them.

This method varies from using the robots.txt file because Google won’t display the specified page within the search results whereas a search engine may still index a page being blocked via the robots.txt file.

If both a noindex tag and a robots.txt directive are in use, the robots.txt file is the primary command. This is because the crawler is being told it can’t crawl the given page so isn’t able to find the noindex tag within the source code.

3. Block by Nofollow Links

Placing a nofollow tag on a page prevents crawlers from following links on this page.

Search engines may still be able to discover pages that have a nofollow directive associated with them via links from other pages.

Also, if a page on a website has a good level of strength being passed to it and a nofollow tag is added, this strength will be lost.

Robots Exclusion Standard Summary

Robots Exclusion Summary table

Things to Note

  • These commands are only directives and may not be honoured
  • Blocking access to important pages on a website will cause harm and organic rankings may drop as a result
  • Different crawlers may interpret the commands differently
  • Malicious crawlers are likely to ignore your exclusion protocols
  • Each subdomain on a root domain uses separate robots.txt files

Worried your Robots.txt could be blocking important pages from being crawled? Ask the experts. Contact our technical SEO team today.

Did you find this page useful?

Comments

About the author:

Tom joined ClickThrough in 2011. Since then, he has developed an expertise in the technical side of search engine optimisation. He’s Google Analytics-qualified, and in his current role as digital and technical Executive, carries out monthly SEO activities and provides technical consultancy for several of the company’s largest accounts.