Informatin you need on high tech

 

 

Social network Login

Login either using your Facebook, Google or Twitter account or using your site UserName and Password.

     

Today's Markets

Today's Oil

Crude Oil


$ and Chinese Markets

Visitor counter, Heat Map, Conversion tracking, Search Rank

Advertisement

Latest Comments

  • How to create an article in Joomla from the front-end

    Addie 2014.Jul.11
    This article will help the internet users for building up new webpage or even a blog from start to ...

    Read more...

     
  • What Hosting Companies Don't Tell You, Could Hurt You? Part 1 - Unlimited Space and Bandwidth

    Chris Bao 2013.Oct.07
    Web Hosting Companies could really trap you with marketing advertisements. So you have to be careful ...

    Read more...

     
  • How to create an article in Joomla from the front-end

    James Franklyn 2013.May.07
    Nice tutorial regards Tampa SEO

    Read more...

Our sponsors

Visitors Online

Today 12

Currently are 43 guests and no members online

Powered by Spearhead Software Labs Joomla Facebook Like Button

How to block bad bots or bad web-crawlers?

User Rating: 0 / 5

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 
Custom Search
affiliate_link

A web crawler is a program or automated script that visits Web sites in a methodical, automated manner. This process is called Web crawling or spidering. For detail information, you can read "What is a web crawler?".

Webmaster welcome well known bots(like google, bing, yahoo, alexa) crawling their websites and don't like other bad robots or harmful spammers.

Here I will introduce some strategies to block bad bots.

(1) Use robots.txt

Search engines will search for a special file called robots.txt before spidering your site. The Robots Text File is created specifically to give directions to web crawlers/spiders/robots

Bad bots or good bots? It depends what your website is targeting for. If your website is for English speaking visitors, then you may want to get rid of these Search engine crawlers:

Search Engine Country User Agent
Bai Du China Baiduspider
360.cn China 360Spider
Yandex Russia Yandex
Naver Korea NaverBot
Yeti
Goo Japan moget
ichiro
Rediff India RedBot

To block the search engines above, you need to add the following code into the robots.txtfile (if you don't have robots.txt file, then create one). Remember, you must use a text editor such as Windows Notepad to modify the robots.txt file.

      
User-agent: Baiduspider
Disallow: /

User-agent: 360Spider
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: NaverBot
User-agent: Yeti
Disallow: /

User-agent: moget
User-agent: ichiro
Disallow: /

User-agent: RedBot
Disallow: /     

When you are done, save the robots.txt file, upload it to your websites root directory.

The robots.txt is useful for polite bots, but spammers are generally not polite so they tend to ignore the robots.txt. It's great if you have robots.txt since it can help the polite bots. However, be careful not to block the wrong path as it can block the good bots from crawling content that you actually want them to crawl.

Some bots just ignore it. Some malicious bots crawl from any IP address from any botnet of hundreds to millions of infected devices from all around the globe. For example, the aggressive 360Spider makes too many requests per second. This frequent requests may make the server overload. It just ignore the robots.txt file. The robots.txt will not stop the crappy 360 bot.

(2) Using htaccess and HTTP_USER_AGENT

The .htaccess file is a file on the server(located under the domain root) that can be used to control access to your website and some behaviors of your website. Using a text editor to create and edit your .htaccess file with the following rules as you need.

Block a single IP address
deny from 123.123.123.123
Block a range of IP addresses

You can leave off the last few octets of an IP address to block everything in that range, the following example would block 123.123.123.1 - 123.123.123.255

deny from 123.123.123

You can also use CIDR (Classless Inter-Domain Routing) notation for the IPs. So for instance [123.123.123.0/24] would block the range 123.123.123.1 - 123.123.123.255, and [123.123.123.0/18] would block the range 123.123.64.1 - 123.123.127.255

deny from 123.123.123.0/24
Block bad users based on their user-agent string:

The following code turns on the Apache RewriteEngine, then the next line looks at the user-agent string of a request. In this case if any of the words Baiduspider, HTTrack, or Yandex are mentioned anywhere in the string it then moves on to the RewriteRule which simply takes the original request and turns it into a 403 response with the R=403 bit of the redirect code:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(Baiduspider|360Spider|Yandex|NaverBot|Yeti|moget|ichiro|RedBot).*$ [NC]
# ISSUE 403 / Serve Errordocument
RewriteRule .* - [R=403,L]

Explanation:

  1. RewriteEngine on: It turns on the actual URL rewriting engine that will fetch the request URL and change it according the following rules and conditions in the file.
  2. RewriteCond: This is back reference to that will match our hostname/domain and if it starts with domain.com and it's not a sub domain like test.domain.com, next row RewriteRule will be executed.
  3. The various User Agents to be blocked from access are listed in the expression.
  4. The Rewrite conditions are connected via "OR".
  5. "NC": "no case" - case-insensitive execution.
  6. The caret "^" character stipulates that the User Agent must start with the listed string (e.g. "Baiduspider").
  7. RewriteRule This specifies a rule that will be executed upon request. It receives two parameters. The first one is a regular expression which fetch entire URL and replace it with the second parameter.
  8. "[R=403" The flag defines a redirect also known as Error which will preserve your SEO ranking and will prevent search engines penalties..

Blocking by user-agent is not fool-proof either, because spammers often impersonate browsers and other popular user agents (such as the Google bots). As a matter of fact, spoofing the user agent is one of the easiest thing that a spammer can do.

  

Bot Traps

This is probably the best way protect yourself from bots that are not polite and that don't correctly identify themselves with the User-Agent. There are at least two types of traps:

The robots.txt trap (which only works if the bot reads the robots.txt): dedicate an off-limits directory in the robots.txt and set up your server to block the IP address of any entity which tries to visit that directory.

Create "hidden" links in your web pages that also lead to the forbidden directory and any bot that crawls those links AND doesn't abide by your robots.txt will step into the trap and get the IP blocked.

A hidden link is one which is not visible to a person, such as an anchor tag with no text: . Alternately, you can have text in the anchor tag, but you can make the font really small and change the text color to match the background color so that humans can't see the link. The hidden link trap can catch any non-human bot, so I'd recommend that you combine it with the robots.txt trap so that you only catch bad bots.

Verifying Bots

The above steps will probably help you get rid of 99.9% of the spammers, but there might be a handful of bad bots who impersonate a popular bot (such as Googlebot) AND abide by your robots.txt; those bots can eat up the number of requests you've allocated for Googlebot and may cause you to temporarily disallow Google from crawling your website. In that case you have one more option and that's to verify the identity of the bot. Most major crawlers (that you'd want to be crawled by) have a way that you can identify their bots, here is Google's recommendation for verifying their bot: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Any bot that impersonates another major bot and fails verification can be blocked by IP. That should probably get you closer to preventing 99.99% of the bad bots from crawling your site.

Pros:

Cons: