Web crawlers, also known as robots or spiders, are automated scripts used by search engines and other entities to scan your web content. This guide is aimed to help outline the best practices for protecting your website from these crawlers while still allowing your site to be discoverable on search engines.
What is a web crawler?
A web crawler is an automated script deployed by search engines or other entities that crawl every page that is publicly known on your website. By crawling your website, it is able to take the data from your website and use it according to that crawler's purpose, which is dependent on the type of crawler it is. That gets us into what kind of crawlers are out there, which we'll dive into quickly before showing how to limit or block specific crawlers.
Search Engine Crawler - They index and rank your website based on the content discovered during their crawling of your website.
AI Crawler - They use the information on your website to learn and expand the answers it can give to users using the AI tools.
Malicious Crawler - Attackers use crawlers to perform various tasks including scanning for vulnerabilities they can use to attack your website or perform brute force logins attempts.
Archival Crawler - There are some entities that attempt to archive all web content for historical purposes, such as the Wayback Machine (Internet Archive).
Social Media Crawler - These crawlers scan the content of your website in order to know how to display content when someone links to your website from their platform, such as a Twitter card.
Note that there are other types of crawlers, but these are the most common that you will see when viewing your website's access logs.
Are there any web crawlers I do not want to block?
When choosing which crawlers to block you must think about what is important for your website. For example, if SEO is important to you for Google or Bing page rankings you may want to allow Googlebot and Bingbot to crawl your website. If your social media presence is important to you then you may want to allow Twitterbot or FacebookBot to crawl your website.
Even when allowing specific crawler(s), it is important to slow down the rate at which they crawl your site, so they do not impact the performance of the server and therefore your website. Several of the most common and legitimate web crawlers hit your website one to two times per second. Depending on the type of content being served and the resources allocated, this could impact the server performance and contribute to slower website load times.
Using Robots.txt to slow down or block crawlers
When inspecting your website's access logs you may notice some crawlers attempting to load a robots.txt file at the root of the website. This file is loaded by crawlers periodically and gives the crawler specific rules to follow, so that the crawler knows how it should crawl your website. This is known as the Robots Exclusion Protocol (REP) and is used to avoid server overload and specific content on your site from being crawled.
Where do I put the file?
The robots.txt file should be placed at the root of your website in order to be accessed by crawlers. Placing the file in any other directory will result in the crawler not finding the file and therefore not having any rules or limitations applied to the crawler.
What do I put inside of the file?
There are several directives that can be used to control these crawlers. It is important to note that not all crawlers support each directive, and some crawlers do not support robots.txt at all. Let's go over each of the directives available and then provide some example scenarios:
User-agent - The name of the crawler that the following rule should apply to.
Crawl-delay - The amount of time that the crawler should wait between each page request.
Disallow - Using this directive, you can tell the crawler not to crawl specific folder(s). You can add multiple disallow lines to block access to several paths.
Allow - Using this directive, you can tell the crawler it is allowed to access and crawl specific folder(s). You can add multiple allow lines to allow access to several paths.
Below are some examples of how to use these directives in your robots.txt file:
Examples:
The first example, which is below, tells every crawler to have a crawl-delay of 15 seconds. However, when Googlebot crawls the website, it will pick up the specific crawl-delay (10) set for its user-agent.
User-agent: * Crawl-delay: 15 User-agent: Googlebot Crawl-delay: 10
In the second example, which is below, we add to the previous example by telling all user agents to NOT crawl anything inside of the /includes/ or /secure/ folders. However, once again, we are overriding the default values for the Googlebot user-agent so that it is allowed to crawl the /includes/ folder. Googlebot will still be disallowed from crawling the /secure/ folder, since we did not specifically override that directory for Googlebot.
# All crawlers are disallowed from crawling directories "includes" and "secure". User-agent: * Crawl-delay: 15 Disallow: /includes/ Disallow: /secure/ # Allow Googlebot to crawl "includes" directory, while not allowing to crawl 'secure' directory to rule above. User-agent: Googlebot Crawl-delay: 10 Allow: /includes/
How long after I change the file for it to take effect?
It is a common misconception that right after creating or changing the file it will take effect immediately. In reality, the crawler has no idea that the file has been modified or created, so it will continue to crawl until it completes the crawling of the website or when it is time for it to re-check the robots.txt file, whichever comes first. The time set for how often it checks for changes is determined by the crawler creator, so you may not see a major improvement right away.
It is best to allow 24-48 hours after making changes to the file for it to take effect, although for some crawlers it may only take a few minutes or hours. It is worth noting that some crawlers allow you to request a re-crawl of your website, such as Google, which will essentially start the crawling over and result in the robots.txt to be loaded by the crawler.
What do I do for crawlers that do not respect robots.txt?
If a web crawler does not support robots.txt rules, then there is a good chance you want to block the user-agent related to that crawler altogether. Depending on your OS and/or webserver choice, the method for blocking this will vary and might be covered in the below sections. You can also block specific user-agents directly within CloudFlare if your website is behind their web application firewall (WAF - included in their free plan).
Blocking user-agent in Windows IIS
If your website is on a Windows server running IIS web server and you need to block specific user-agents, this can be done easily through the IIS GUI or directly in your website's web.config file. We'll walk through the steps depending on which method (using the GUI or web.config file directly) you choose:
Method: IIS GUI
1. Login to the server via Remote Desktop Protocol (RDP).
2. Open IIS Manager
A) An easy way to open IIS manager is through the run dialog by typing inetmgr and then clicking OK.
B) Another easy way to open IIS Manager is to open the Administrative Tools (Windows Start Menu > Administrative Tools) and then double-clicking 'Internet Information Services'.
C) Typically, on a Windows server with IIS installed you will find IIS Manager pinned to the start menu as well.
3. Determine if you want to block the user-agents on the server level (all sites) or a specific site, then click on the appropriate node. For example, if you want the rule to apply to all websites then click on the server node (usually the name of the server). If you want the rule to apply to a specific website, then expand the 'Sites' node and click on the website in question.
4. Double-click on the feature labeled 'URL Rewrite' found within the IIS section. You can also filter for this if needed using the search box.
5. In the right-side actions pane, click the option labeled 'Add Rule(s)...', then choose Blank Rule under the Inbound rules, and then click OK.
6. You should now be creating a new URL rewrite rule. Enter a name for the rule such as 'Block bad user-agents'.
7. Under the Match URL section choose the following settings:
- Requested URL: Matches the pattern
- Using: Regular Expressions
- Pattern: .*
The above pattern will match any URL under your website(s). If you want to block only a specific folder, then you will need to adjust the pattern as needed.
8. Under the Conditions section choose the following options:
- Condition Input: {HTTP_USER_AGENT}
- Check if input string: Matches the pattern
- Pattern: Use one of the patterns below depending on your specific needs (blocking multiple crawlers or just one). Note when blocking multiple, you can add more as needed using a pipe | character between each crawler.
.*(facebookexternalhit|PetalBot|GPTBot).*
.*facebookexternalhit.*
9. Under the Actions section choose the following options:
- Action Type: Custom Response
- Status Code: 403
- Substatus Code: 0
- Reason: Forbidden
- Error Description: Access denied
10. To save the rule click 'Apply' in the actions pane. We recommend testing the website to make sure it's working as expected, as a malformed rule can break the website(s). If needed select the rule in the GUI and choose the option to delete the rule or edit the rule to ensure there are no syntax errors.
Method: Updating the web.config directly
1. Open or create the web.config file at the root of your website.
2. If any web.config configuration sections already exist within the file, then only copy the <rewrite> part of the below code to be placed under the <system.webServer> section. Otherwise, place the entire contents of the below file inside the web.config and modify the user-agents in the rule as needed.
<configuration> <system.webServer> <rewrite> <rules> <rule name="BlockMultipleBots" stopProcessing="true"> <match url=".*" /> <conditions> <add input="{HTTP_USER_AGENT}" pattern=".*(facebookexternalhit|PetalBot|GPTBot).*" /> </conditions> <action type="CustomResponse" statusCode="403" subStatusCode="0" reason="Forbidden" description="Access denied." /> </rule> </rules> </rewrite> </system.webServer> </configuration>
Blocking user-agent in Apache (cPanel, Virtualmin, etc.)
If your website is on a Linux server running Apache web server and you need to block specific user-agents, this can be done easily through your website's .htaccess file by following the below steps:
1. Open or create the .htaccess file at the root of your website. Note that you may need to enable viewing of hidden files (in cPanel, click settings, then enable checkbox for hidden files and save).
2. Once you have the file open, add the below to the top of the file:
<IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (facebookexternalhit|PetalBot|GPTBot) [NC] RewriteRule .* - [F,L] </IfModule>
3. Change the names of the bots as necessary or add more by separating with the pipe ( | ) character.
4. Save your changes and test the website to ensure it is still working as expected.
Blocking user-agent in Cloudflare
Having your website behind Cloudflare is a great way to reduce the load on your server, as well as potentially speed up your users request to the website, but what if you want to block specific user-agents from accessing your website? This can be easily done using their web application firewall (WAF) custom rules. Please see the steps below:
1. Login to your Cloudflare account for the domain in question. If you have multiple domains under the account, select the domain you wish to manage the rules for.
2. Expand the 'Security' section, then click on the option labeled 'WAF'.
3. If not already there, click on the Custom Rules link in the WAF navigation tabs.
4. Click the + Create rule button to create a new WAF rule.
5. Give your rule a good name that lets you know what the rule is intended for.
6. In the 'When incoming request match...' section choose the following options:
- Field: User Agent
- Operator: contains
- Value: Enter the name of the crawler's user-agent you wish to block.
7. In the 'Then take action...' section choose the following option:
- Choose action: Block
8. Click the 'Deploy' button to save your new firewall rule. Be sure to test the website afterwards to make sure it is still working as expected.
Additional Note: Cloudflare also has some new bot fighting modes that will automatically create some firewall rules on your behalf when enabled. You can view and change these options within the Security > Bots section within Cloudflare.