Web crawlers visit websites every day. It is not so well known fact that more than half of the web traffic comes from robots, not from real users. By browsing through the server logs, one can encounter a great variety of robots that perform activities that are friendly or evil: scan pages for email addresses, search for vulnerabilities, collect data about the site, and index it in search engines (like Googlebot). Some robots scrape content from pages and place it on the Internet on their own address, and this is not good for our site. So there is a need of blocking robots…
Blocking robots
Many of these robots can be blocked. This gives you a handful of benefits: less spam, safer website, less stolen content. The amount of data tranferred in the hosting account is also decreasing as the blocked robots do not consume the transfer.
Blocking robots – how it is done?
There are several ways to block robots. You can block robots in robots.txt. Aggressive robots bypass this file, and therefore, another method is better, blocking robots by the agent name at the web server level. This way, the robot, if it uses any banned user agent, will simply be blocked and will receive the http 403 code – forbidden access. Of course, some robots impersonate of search engines or browsers and nothing can be done here, but many robots use standard Perl or Java user names and are easy to block. Our blocking method is fairly universal and works on most web hosting hosting .htaccess files. In fact, these are just 3 long lines in htaccess that do block robots, which does not burden the server very much.
How have we compiled the unwanted robots list?
Our experience shows that a viable way to block robots based on user agent is to list unwanted robots explicitly. Wwe have processed with regex logs of various websites from the last 10 years. Then we wrote a special program that checked out parts of the robot names so that the list is as short as possible. The result is a list of over 1800 robots we do not want. This list is constantly being used on various sites and is updated. Internet search engines such as Google, Bing, Yandex, Yahoo, as well as social networking sites such as Twitter and Facebook, have been removed from the blocking robots list, as we consider these bots useful.
All other robots are on the list. Because web servers do not allow so long lines in .htaccess, the file has been split into fragments containing about 500 robots names each. We have installed this file on many web sites, we had no problems with this file with any of the hosting companies.
Does blocking robots affect your search engine rankings?
Blocking robots in this form has no effect on SEO. We have a bunch of #1 pages that uses this blocking robots technique. Google uses its robot and also sometimes masquerades as popular browsers such as Chrome or Firefox to detect cloaking. All these robots are not blocked by our file. Google’s other robot names in our opinion do not appear.
How to block robots on your website?
At the end of this article there is a text to paste at the beginning of the .htaccess file of our website. Do not change anything in this text, every space and every enter is important. You can not add comments either. Just mark all the text below and paste it to the top of our .htaccess file, right after the “Rewrite Engine On” command (if you have this command in .htaccess). Additional note: in the file there must be a space before the entries [NC, OR] and [NC]. Sometimes it may happen that when you paste this space will disappear in the new file, without this space blocking robots will not work, although the server will not report any error.
How to check if we are blocking robots correctly?
The Xenu Link Sleuth program is blocked on the list. After uploading the .htaccess file to the server, you should not be able to load your page with Xenu, “forbidden request” appears. If this is the case, blocking robots works properly.
Copyright for blocking robots
The following list and the way to block the robots are provided completely free, in the hope that there will be less spam on the Internet. If the list works for you, you are using the list and you are happy, do not forget to like or share this page on Facebook or Twitter ?
Complaints ?
If the blocking robots list does not working properly, please inform us via the comments. Thank you for any suggestions with regards to the list. Periodically, an update version will appear here. The frequency will depend on the number of newly released spambots. New list publication will be published on Twitter, it is worth to watch us! We invite you too to use our services, and we will block the bad bots for you!
Update Dec 2017
Update – unlocked some Opera browsers. The have “edition” in the user string.
#bad bots start #programmed by tab-studio.com public version 2017.12 #1 new rule every 500 entries RewriteCond %{HTTP_USER_AGENT} \ 12soso|\ 192\.comagent|\ 1noonbot|\ 1on1searchbot|\ 3de\_search2|\ 3d\_search|\ 3g\ bot|\ 3gse|\ 50\.nu|\ a1\ sitemap\ generator|\ a1\ website\ download|\ a6\-indexer|\ aasp|\ abachobot|\ abonti|\ abotemailsearch|\ aboundex|\ aboutusbot|\ accmonitor\ compliance\ server|\ accoon|\ achulkov\.net\ page\ walker|\ acme\.spider|\ acoonbot|\ acquia\-crawler|\ activetouristbot|\ ad\ muncher|\ adamm\ bot|\ adbeat\_bot|\ adminshop\.com|\ advanced\ email\ extractor|\ aesop\_com\_spiderman|\ aespider|\ af\ knowledge\ now\ verity\ spider|\ aggregator:vocus|\ ah\-ha\.com\ crawler|\ ahrefs|\ aibot|\ aidu|\ aihitbot|\ aipbot|\ aisiid|\ aitcsrobot/1\.1|\ ajsitemap|\ akamai\-sitesnapshot|\ alexawebsearchplatform|\ alexfdownload|\ alexibot|\ alkalinebot|\ all\ acronyms\ bot|\ alpha\ search\ agent|\ amerla\ search\ bot|\ amfibibot|\ ampmppc\.com|\ amznkassocbot|\ anemone|\ anonymous|\ anotherbot|\ answerbot|\ answerbus|\ answerchase\ prove|\ antbot|\ antibot|\ antisantyworm|\ antro\.net|\ aonde\-spider|\ aport|\ appengine\-google|\ appid\:\ s\~stremor\-crawler\-|\ aqua\_products|\ arabot|\ arachmo|\ arachnophilia|\ archive\.org\_bot|\ aria\ equalizer|\ arianna\.libero\.it|\ arikus\_spider|\ art\-online\.com|\ artavisbot|\ artera|\ asaha\ search\ engine\ turkey|\ ask|\ aspider|\ aspseek|\ asterias|\ astrofind|\ athenusbot|\ atlocalbot|\ atomic\_email\_hunter|\ attach|\ attrakt|\ attributor|\ augurfind|\ auresys|\ autobaron\ crawler|\ autoemailspider|\ autowebdir|\ avsearch\-|\ axfeedsbot|\ axonize\-bot|\ ayna|\ b2w|\ backdoorbot|\ backrub|\ backstreet\ browser|\ backweb|\ baidu|\ bandit|\ batchftp|\ baypup|\ bdfetch|\ becomebot|\ becomejpbot|\ beetlebot|\ bender|\ besserscheitern\-crawl|\ betabot|\ big\ brother|\ big\ data|\ bigado\.com|\ bigcliquebot|\ bigfoot|\ biglotron|\ bilbo|\ bilgibetabot|\ bilgibot|\ bintellibot|\ bitlybot|\ bitvouseragent|\ bizbot003|\ bizbot04|\ bizworks\ retriever|\ black\ hole|\ black\.hole|\ blackbird|\ blackmask\.net\ search\ engine|\ blackwidow|\ bladder\ fusion|\ blaiz\-bee|\ blexbot|\ blinkx|\ blitzbot|\ blog\ conversation\ project|\ blogmyway|\ blogpulselive|\ blogrefsbot|\ blogscope|\ blogslive|\ bloobybot|\ blowfish|\ blt|\ bnf\.fr\_bot|\ boaconstrictor|\ boardreader|\ boia\-scan\-agent|\ boia\.org|\ boitho|\ boi\_crawl\_00|\ bookmark\ buddy\ bookmark\ checker|\ bookmark\ search\ tool|\ bosug|\ bot\ apoena|\ botalot|\ botrighthere|\ botswana|\ bottybot|\ bpbot|\ braintime\_search|\ brokenlinkcheck\.com|\ browseremulator|\ browsermob|\ bruinbot|\ bsearchr&d|\ bspider|\ btbot|\ btsearch|\ bubing|\ buddy|\ buibui|\ buildcms\ crawler|\ builtbottough|\ bullseye|\ bumblebee|\ bunnyslippers|\ buscadorclarin|\ buscaplus\ robi|\ butterfly|\ buyhawaiibot|\ buzzbot|\ byindia|\ byspider|\ byteserver|\ bzbot|\ c\ r\ a\ w\ l\ 3\ r|\ cacheblaster|\ caddbot|\ cafi|\ camcrawler|\ camelstampede|\ canon\-webrecord|\ careerbot|\ cataguru|\ catchbot|\ cazoodle|\ ccbot|\ ccgcrawl|\ ccubee|\ cd\-preload|\ ce\-preload|\ cegbfeieh|\ cerberian\ drtrs|\ cert\ figleafbot|\ cfetch|\ cfnetwork|\ chameleon|\ charlotte|\ check&get|\ checkbot|\ checklinks|\ cheesebot|\ chemiede\-nodebot|\ cherrypicker|\ chilkat|\ chinaclaw|\ cipinetbot|\ cis455crawler|\ citeseerxbot|\ cizilla|\ clariabot|\ climate\ ark|\ climateark\ spider|\ clshttp|\ clushbot|\ coast\ scan\ engine|\ coast\ webmaster\ pro|\ coccoc|\ collapsarweb|\ collector|\ colocrossing|\ combine|\ connectsearch|\ conpilot|\ contentsmartz|\ contextad\ bot|\ contype|\ cookienet|\ coolbot|\ coolcheck|\ copernic|\ copier|\ copyrightcheck|\ core\-project|\ cosmos|\ covario\-ids|\ cowbot\-|\ cowdog\ bot|\ crabbybot|\ craftbot\@yahoo\.com|\ crawler\.kpricorn\.org|\ crawler43\.ejupiter\.com|\ crawler4j|\ crawler@|\ crawler\_for\_infomine|\ crawly|\ crawl\_application|\ creativecommons|\ crescent|\ cs\-crawler|\ cse\ html\ validator|\ cshttpclient|\ cuasarbot|\ culsearch|\ curl|\ custo|\ cvaulev|\ cyberdog|\ cybernavi\_webget|\ cyberpatrol\ sitecat\ webbot|\ cyberspyder|\ cydralspider|\ d1garabicengine|\ datacha0s|\ datafountains|\ dataparksearch|\ dataprovider\.com|\ datascape\ robot|\ dataspearspiderbot|\ dataspider|\ dattatec\.com|\ daumoa|\ dblbot|\ dcpbot|\ declumbot|\ deepindex|\ deepnet\ crawler|\ deeptrawl|\ dejan|\ del\.icio\.us\-thumbnails|\ deltascan|\ delvubot|\ der\ groยงe\ bildersauger|\ der\ groรe\ bildersauger|\ deusu|\ dfs\-fetch|\ diagem|\ diamond|\ dibot|\ didaxusbot|\ digext|\ digger|\ digi\-rssbot|\ digitalarchivesbot|\ digout4u|\ diibot|\ dillo|\ dir\_snatch\.exe|\ disco|\ distilled\-reputation\-monitor|\ djangotraineebot|\ dkimrepbot|\ dmoz\ downloader|\ docomo|\ dof\-verify|\ domaincrawler|\ domainscan|\ domainwatcher\ bot|\ dotbot|\ dotspotsbot|\ dow\ jones\ searchbot|\ download|\ doy|\ dragonfly|\ drip|\ drone|\ dtaagent|\ dtsearchspider|\ dumbot|\ dwaar|\ dxseeker|\ e\-societyrobot|\ eah|\ earth\ platform\ indexer|\ earth\ science\ educator\ \ robot|\ easydl|\ ebingbong|\ ec2linkfinder|\ ecairn\-grabber|\ ecatch|\ echoosebot|\ edisterbot|\ edugovsearch|\ egothor|\ eidetica\.com|\ eirgrabber|\ elblindo\ the\ blind\ bot|\ elisabot|\ ellerdalebot|\ email\ exractor|\ emailcollector|\ emailleach|\ emailsiphon|\ emailwolf|\ emeraldshield|\ empas\_robot|\ enabot|\ endeca|\ enigmabot|\ enswer\ neuro\ bot|\ enter\ user\-agent|\ entitycubebot|\ erocrawler|\ estylesearch|\ esyndicat\ bot|\ eurosoft\-bot|\ evaal|\ eventware|\ everest\-vulcan\ inc\.|\ exabot|\ exactsearch|\ exactseek|\ exooba|\ exploder|\ express\ webpictures|\ extractor|\ eyenetie|\ ez\-robot|\ ezooms|\ f\-bot\ test\ pilot|\ factbot|\ fairad\ client|\ falcon|\ fast\ data\ search\ document\ retriever|\ fast\ esp|\ fast\-search\-engine|\ fastbot\ crawler|\ fastbot\.de\ crawler|\ fatbot|\ favcollector|\ faviconizer|\ favorites\ sweeper|\ fdm|\ fdse\ robot|\ fedcontractorbot|\ fembot|\ fetch\ api\ request|\ fetch\_ici|\ fgcrawler|\ filangy|\ filehound|\ findanisp\.com\_isp\_finder|\ findlinks|\ findweb|\ firebat|\ firstgov\.gov\ search|\ flaming\ attackbot|\ flamingo\_searchengine|\ flashcapture|\ flashget|\ flickysearchbot|\ fluffy\ the\ spider|\ flunky|\ focused\_crawler|\ followsite|\ foobot|\ fooooo\_web\_video\_crawl|\ fopper|\ formulafinderbot|\ forschungsportal|\ francis|\ freewebmonitoring\ sitechecker|\ freshcrawler|\ freshdownload|\ freshlinks\.exe|\ friendfeedbot|\ frodo\.at|\ froggle|\ frontpage|\ froola\ bot|\ fr\_crawler|\ fu\-nbi|\ full\_breadth\_crawler|\ funnelback|\ furlbot|\ g10\-bot|\ gaisbot|\ galaxybot|\ gazz|\ gbplugin|\ generate\_infomine\_category\_classifiers|\ genevabot|\ geniebot|\ genieo|\ geomaxenginebot|\ geometabot|\ geonabot|\ geovisu|\ germcrawler\ |\ gethtmlcontents|\ getleft|\ getright|\ getsmart|\ geturl\.rexx|\ getweb!|\ giant|\ gigablastopensource|\ gigabot|\ girafabot|\ gleamebot|\ gnome\-vfs|\ go!zilla|\ go\-ahead\-got\-it|\ go\-http\-client|\ goforit\.com|\ goforitbot|\ gold\ crawler|\ goldfire\ server|\ golem|\ goodjelly|\ gordon\-college\-google\-mini|\ goroam|\ goseebot|\ gotit|\ govbot|\ gpu\ p2p\ crawler|\ grabber|\ grabnet|\ grafula|\ grapefx|\ grapeshot|\ grbot|\ greenyogi|\ gromit|\ grub|\ gsa|\ gslfbot|\ gulliver|\ gulperbot|\ gurujibot|\ gvc\ business\ crawler|\ gvc\ crawler|\ gvc\ search\ bot|\ gvc\ web\ crawler|\ gvc\ weblink\ crawler|\ gvc\ world\ links|\ gvcbot\.com|\ happyfunbot|\ harvest|\ hatena\ antenna|\ hawler|\ hcat|\ hclsreport\-crawler|\ hd\ nutch\ agent|\ header\_test\_client|\ healia\ [NC,OR] #500 new rule RewriteCond %{HTTP_USER_AGENT} \ helix|\ here\ will\ be\ link\ to\ crawler\ site|\ heritrix|\ hiscan|\ hisoftware\ accmonitor\ server|\ hisoftware\ accverify|\ hitcrawler|\ hivabot|\ hloader|\ hmsebot|\ hmview|\ hoge|\ holmes|\ homepagesearch|\ hooblybot\-image|\ hoowwwer|\ hostcrawler|\ hsft\ \\-\ link\ scanner|\ hsft\ \\-\ lvu\ scanner|\ hslide|\ ht://check|\ htdig|\ html\ link\ validator|\ htmlparser|\ httplib|\ httrack|\ huaweisymantecspider|\ hul\-wax|\ humanlinks|\ hyperestraier|\ hyperix|\ iaarchiver\-|\ ia\_archiver|\ ibuena|\ icab|\ icds\-ingestion|\ ichiro|\ icopyright\ conductor|\ ieautodiscovery|\ iecheck|\ ihwebchecker|\ iiitbot|\ iim\_405|\ ilsebot|\ iltrovatore|\ image\ stripper|\ image\ sucker|\ image\-fetcher|\ imagebot|\ imagefortress|\ imageshereimagesthereimageseverywhere|\ imagevisu|\ imds\_monitor|\ imo\-google\-robot\-intelink|\ inagist\.com\ url\ crawler|\ indexer|\ industry\ cortex\ webcrawler|\ indy\ library|\ indylabs\_marius|\ inelabot|\ inet32\ ctrl|\ inetbot|\ info\ seeker|\ infolink|\ infomine|\ infonavirobot|\ informant|\ infoseek\ sidewinder|\ infotekies|\ infousabot|\ ingrid|\ inktomi|\ insightscollector|\ insightsworksbot|\ inspirebot|\ insumascout|\ intelix|\ intelliseek|\ interget|\ internet\ ninja|\ internet\ radio\ crawler|\ internetlinkagent|\ interseek|\ ioi|\ ip\-web\-crawler\.com|\ ipadd\ bot|\ ipselonbot|\ ips\-agent|\ iria|\ irlbot|\ iron33|\ isara|\ isearch|\ isilox|\ istellabot|\ its\-learning\ crawler|\ iu\_csci\_b659\_class\_crawler|\ ivia|\ jadynave|\ java|\ jbot|\ jemmathetourist|\ jennybot|\ jetbot|\ jetbrains\ omea\ pro|\ jetcar|\ jim|\ jobo|\ jobspider\_ba|\ joc|\ joedog|\ joyscapebot|\ jspyda|\ junut\ bot|\ justview|\ jyxobot|\ k\.s\.bot|\ kakclebot|\ kalooga|\ katatudo\-spider|\ kbeta1|\ keepni\ web\ site\ monitor|\ kenjin\.spider|\ keybot\ translation\-search\-machine|\ keywenbot|\ keyword\ density|\ keyword\.density|\ kinjabot|\ kitenga\-crawler\-bot|\ kiwistatus|\ kmbot\-|\ kmccrew\ bot\ search|\ knight|\ knowitall|\ knowledge\ engine|\ knowledge\.com|\ koepabot|\ koninklijke|\ korniki|\ krowler|\ ksbot|\ kuloko\-bot|\ kulturarw3|\ kummhttp|\ kurzor|\ kyluka\ crawl|\ l\.webis|\ labhoo|\ labourunions411|\ lachesis|\ lament|\ lamerexterminator|\ lapozzbot|\ larbin|\ lbot|\ leaptag|\ leechftp|\ leechget|\ letscrawl\.com|\ lexibot|\ lexxebot|\ lftp|\ libcrawl|\ libiviacore|\ libw|\ likse|\ linguee\ bot|\ link\ checker|\ link\ validator|\ linkalarm|\ linkbot|\ linkcheck\ by\ siteimprove\.com|\ linkcheck\ scanner|\ linkchecker|\ linkdex\.com|\ linkextractorpro|\ linklint|\ linklooker|\ linkman|\ links\ sql|\ linkscan|\ linksmanager\.com\_bot|\ linksweeper|\ linkwalker|\ link\_checker|\ litefinder|\ litlrbot|\ little\ grabber\ at\ skanktale\.com|\ livelapbot|\ lm\ harvester|\ lmqueuebot|\ lnspiderguy|\ loadtimebot|\ localcombot|\ locust|\ lolongbot|\ lookbot|\ lsearch|\ lssbot|\ lt\ scotland\ checklink|\ ltx71.com|\ lwp|\ lycos\_spider|\ lydia\ entity\ spider|\ lynnbot|\ lytranslate|\ mag\-net|\ magnet|\ magpie\-crawler|\ magus\ bot|\ mail\.ru|\ mainseek\_bot|\ mammoth|\ map\ robot|\ markwatch|\ masagool|\ masidani\_bot\_|\ mass\ downloader|\ mata\ hari|\ mata\.hari|\ matentzn\ at\ cs\ dot\ man\ dot\ ac\ dot\ uk|\ maxamine\.com\-\-robot|\ maxamine\.com\-robot|\ maxomobot|\ mcbot|\ medrabbit|\ megite|\ memacbot|\ memo|\ mendeleybot|\ mercator\-|\ mercuryboard\_user\_agent\_sql\_injection\.nasl|\ metacarta|\ metaeuro\ web\ search|\ metager2|\ metagloss|\ metal\ crawler|\ metaquerier|\ metaspider|\ metaspinner|\ metauri|\ mfcrawler|\ mfhttpscan|\ midown\ tool|\ miixpc|\ mini\-robot|\ minibot|\ minirank|\ mirror|\ missigua\ locator|\ mister\ pix|\ mister\.pix|\ miva|\ mj12bot|\ mnogosearch|\ moduna\.com|\ mod\_accessibility|\ moget|\ mojeekbot|\ monkeycrawl|\ moses|\ mowserbot|\ mqbot|\ mse360|\ msindianwebcrawl|\ msmobot|\ msnptc|\ msrbot|\ mt\-soft|\ multitext|\ my\-heritrix\-crawler|\ myapp|\ mycompanybot|\ mycrawler|\ myengines\-us\-bot|\ myfamilybot|\ myra|\ my\_little\_searchengine\_project|\ nabot|\ najdi\.si|\ nambu|\ nameprotect|\ nasa\ search|\ natchcvs|\ natweb\-bad\-link\-mailer|\ naver|\ navroad|\ nearsite|\ nec\-meshexplorer|\ neosciocrawler|\ nerdbynature\.bot|\ nerdybot|\ nerima\-crawl-|\ nessus|\ nestreader|\ net\ vampire|\ net::trackback|\ netants|\ netcarta\ cyberpilot\ pro|\ netcraft|\ netexperts|\ netid\.com\ bot|\ netmechanic|\ netprospector|\ netresearchserver|\ netseer|\ netshift=|\ netsongbot|\ netsparker|\ netspider|\ netsrcherp|\ netzip|\ newmedhunt|\ news\ bot|\ newsgatherer|\ newsgroupreporter|\ newstrovebot|\ news\_search\_app|\ nextgensearchbot|\ nextthing\.org|\ nicebot|\ nicerspro|\ niki\-bot|\ nimblecrawler|\ nimbus\-1|\ ninetowns|\ ninja|\ njuicebot|\ nlese|\ nogate|\ norbert\ the\ spider|\ noteworthybot|\ npbot|\ nrcan\ intranet\ crawler|\ nsdl\_search\_bot|\ nuggetize\.com\ bot|\ nusearch\ spider|\ nutch|\ nu\_tch|\ nwspider|\ nymesis|\ nys\-crawler|\ objectssearch|\ obot|\ obvius\ external\ linkcheck|\ ocelli|\ octopus|\ odp\ entries\ t\_st|\ oegp|\ offline\ navigator|\ offline\.explorer|\ ogspider|\ omiexplorer\_bot|\ omniexplorer|\ omnifind|\ omniweb|\ onetszukaj|\ online\ link\ validator|\ oozbot|\ openbot|\ openfind|\ openintelligencedata|\ openisearch|\ openlink\ virtuoso\ rdf\ crawler|\ opensearchserver\_bot|\ opidig|\ optidiscover|\ oracle\ secure\ enterprise\ search|\ oracle\ ultra\ search|\ orangebot|\ orisbot|\ ornl\_crawler|\ ornl\_mercury|\ osis\-project\.jp|\ oso|\ outfoxbot|\ outfoxmelonbot|\ owler\-bot|\ owsbot|\ ozelot|\ p3p\ client|\ pagebiteshyperbot|\ pagebull|\ pagedown|\ pagefetcher|\ pagegrabber|\ pagepeeker|\ pagerank\ monitor|\ page\_verifier|\ pamsnbot\.htm|\ panopy\ bot|\ panscient\.com|\ pansophica|\ papa\ foto|\ paperlibot|\ parasite|\ parsijoo|\ pathtraq|\ pattern|\ patwebbot|\ pavuk|\ paxleframework|\ pbbot|\ pcbrowser|\ pcore\-http|\ pd\-crawler|\ penthesila|\ perform\_crawl|\ perman|\ personal\ ultimate\ crawler|\ php\ version\ tracker|\ phpcrawl|\ phpdig|\ picosearch|\ pieno\ robot|\ pipbot|\ pipeliner|\ pita|\ pixfinder|\ piyushbot|\ planetwork\ bot\ search|\ plucker|\ plukkie|\ plumtree|\ pockey|\ pocohttp|\ pogodak\.ba|\ pogodak\.co\.yu|\ poirot|\ polybot|\ pompos|\ poodle\ predictor|\ popscreenbot|\ postpost|\ privacyfinder|\ projectwf\-java\-test\-crawler|\ propowerbot|\ prowebwalker|\ proxem\ websearch|\ proximic|\ proxy\ crawler|\ psbot|\ pss\-bot|\ psycheclone|\ pub\-crawler|\ pucl|\ pulsebot|\ pump|\ pwebot|\ python|\ qeavis\ agent|\ qfkbot|\ qualidade|\ qualidator\.com\ bot|\ quepasacreep|\ queryn\ metasearch|\ queryn\.metasearch|\ quest\.durato|\ quintura\-crw|\ qunarbot|\ qwantify|\ qweerybot|\ qweery\_robot\.txt\_checkbot|\ r2ibot|\ r6\_commentreader|\ r6\_feedfetcher|\ r6\_votereader|\ rabot|\ radian6|\ radiation\ retriever|\ rampybot|\ rankivabot|\ rankur|\ rational\ sitecheck|\ rcstartbot|\ realdownload|\ reaper|\ rebi\-shoveler|\ recorder|\ redbot|\ redcarpet|\ reget|\ repomonkey|\ research\ robot|\ riddler|\ riight|\ risenetbot|\ riverglassscanner\ [NC,OR] #1000 new rule RewriteCond %{HTTP_USER_AGENT} \ robopal|\ robosourcer|\ robotek|\ robozilla|\ roger|\ rome\ client|\ rondello|\ rotondo|\ roverbot|\ rpt\-httpclient|\ rtgibot|\ rufusbot|\ runnk\ online\ rss\ reader|\ runnk\ rss\ aggregator|\ s2bot|\ safaribookmarkchecker|\ safednsbot|\ safetynet\ robot|\ saladspoon|\ sapienti|\ sapphireweb|\ sbider|\ sbl\-bot|\ scfcrawler|\ scich|\ scientificcommons\.org|\ scollspider|\ scooperbot|\ scooter|\ scoutjet|\ scrapebox|\ scrapy|\ scrawltest|\ screaming\ frog|\ scrubby|\ scspider|\ scumbot|\ search\ publisher|\ search\ x\-bot|\ search\-channel|\ search\-engine\-studio|\ search\.kumkie\.com|\ search\.updated\.com|\ search\.usgs\.gov|\ searcharoo\.net|\ searchblox|\ searchbot|\ searchengine|\ searchhippo\.com|\ searchit\-bot|\ searchmarking|\ searchmarks|\ searchmee!|\ searchmee\_v|\ searchmining|\ searchnowbot|\ searchpreview|\ searchspider\.com|\ searqubot|\ seb\ spider|\ seekbot|\ seeker\.lookseek\.com|\ seeqbot|\ seeqpod\-vertical\-crawler|\ selflinkchecker|\ semager|\ semanticdiscovery|\ semantifire|\ semisearch|\ semrushbot|\ seoengworldbot|\ seokicks|\ seznambot|\ shablastbot|\ shadowwebanalyzer|\ shareaza|\ shelob|\ sherlock|\ shim\-crawler|\ shopsalad|\ shopwiki|\ showlinks|\ showyoubot|\ siclab|\ silk|\ simplepie|\ siphon|\ sitebot|\ sitecheck|\ sitefinder|\ siteguardbot|\ siteorbiter|\ sitesnagger|\ sitesucker|\ sitesweeper|\ sitexpert|\ skimbot|\ skimwordsbot|\ skreemrbot|\ skywalker|\ sleipnir|\ slow\-crawler|\ slysearch|\ smart\-crawler|\ smartdownload|\ smarte\ bot|\ smartwit\.com|\ snake|\ snap\.com\ beta\ crawler|\ snapbot|\ snappreviewbot|\ snappy|\ snookit|\ snooper|\ snoopy|\ societyrobot|\ socscibot|\ soft411\ directory|\ sogou|\ sohu\ agent|\ sohu\-search|\ sokitomi\ crawl|\ solbot|\ sondeur|\ sootle|\ sosospider|\ space\ bison|\ space\ fung|\ spacebison|\ spankbot|\ spanner|\ spatineo\ monitor\ controller|\ spatineo\ serval\ controller|\ spatineo\ serval\ getmapbot|\ special\_archiver|\ speedy|\ sphere\ scout|\ sphider|\ spider\.terranautic\.net|\ spiderengine|\ spiderku|\ spiderman|\ spinn3r|\ spinne|\ sportcrew\-bot|\ sproose|\ spyder3\.microsys\.com|\ sq\ webscanner|\ sqlmap|\ squid\-prefetch|\ squidclamav\_redirector|\ sqworm|\ srevbot|\ sslbot|\ ssm\ agent|\ stackrambler|\ stardownloader|\ statbot|\ statcrawler|\ statedept\-crawler|\ steeler|\ stegmann\-bot|\ stero|\ stripper|\ stumbler|\ suchclip|\ sucker|\ sumeetbot|\ sumitbot|\ summizebot|\ summizefeedreader|\ sunrise\ xp|\ superbot|\ superhttp|\ superlumin\ downloader|\ superpagesbot|\ supremesearch\.net|\ supybot|\ surdotlybot|\ surf|\ surveybot|\ suzuran|\ swebot|\ swish\-e|\ sygolbot|\ synapticwalker|\ syntryx\ ant\ scout\ chassis\ pheromone|\ systemsearch\-robot|\ szukacz|\ s\~stremor\-crawler|\ t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|\ tailrank|\ takeout|\ talkro\ web\-shot|\ tamu\_crawler|\ tapuzbot|\ tarantula|\ targetblaster\.com|\ targetyournews\.com\ bot|\ tausdatabot|\ taxinomiabot|\ teamsoft\ wininet\ component|\ tecomi\ bot|\ teezirbot|\ teleport|\ telesoft|\ teradex\ mapper|\ teragram\_crawler|\ terrawizbot|\ testbot|\ testing\ of\ bot|\ textbot|\ thatrobotsite\.com|\ the\ dyslexalizer|\ the\ intraformant|\ the\.intraformant|\ thenomad|\ theophrastus|\ theusefulbot|\ thumbbot|\ thumbnail\.cz\ robot|\ thumbshots\-de\-bot|\ tigerbot|\ tighttwatbot|\ tineye|\ titan|\ to\-dress\_ru\_bot\_|\ to\-night\-bot|\ tocrawl|\ topicalizer|\ topicblogs|\ toplistbot|\ topserver\ php|\ topyx\-crawler|\ touche|\ tourlentascanner|\ tpsystem|\ traazi|\ transgenikbot|\ travel\-search|\ travelbot|\ travellazerbot|\ treezy|\ trendiction|\ trex|\ tridentspider|\ trovator|\ true\_robot|\ tscholarsbot|\ tsm\ translation\-search\-machine|\ tswebbot|\ tulipchain|\ turingos|\ turnitinbot|\ tutorgigbot|\ tweetedtimes\ bot|\ tweetmemebot|\ twengabot|\ twice|\ twikle|\ twinuffbot|\ twisted\ pagegetter|\ twitturls|\ twitturly|\ tygobot|\ tygoprowler|\ typhoeus|\ u\.s\.\ government\ printing\ office|\ uberbot|\ ucb\-nutch|\ udmsearch|\ ufam\-crawler\-|\ ultraseek|\ unchaos|\ unisterbot|\ unidentified|\ unitek\ uniengine|\ universalsearch|\ unwindfetchor|\ uoftdb\_experiment|\ updated|\ url\ control|\ url\-checker|\ urlappendbot|\ urlblaze|\ urlchecker|\ urlck|\ urldispatcher|\ urlspiderpro|\ urly\ warning|\ urly\.warning|\ url\_gather|\ usaf\ afkn\ k2spider|\ usasearch|\ uss\-cosmix|\ usyd\-nlp\-spider|\ vacobot|\ vacuum|\ vadixbot|\ vagabondo|\ validator|\ valkyrie|\ vbseo|\ vci\ webviewer\ vci\ webviewer\ win32|\ verbstarbot|\ vericitecrawler|\ verifactrola|\ verity\-url\-gateway|\ vermut|\ versus\ crawler|\ versus\.integis\.ch|\ viasarchivinginformation\.html|\ vipr|\ virus\-detector|\ virus\_detector|\ visbot|\ vishal\ for\ clia|\ visweb|\ vital\ search'n\ urchin|\ vlad|\ vlsearch|\ voilabot|\ vmbot|\ vocusbot|\ voideye|\ voil|\ vortex|\ voyager|\ vspider|\ w3c\-webcon|\ w3c\_unicorn|\ w3search|\ wacbot|\ wanadoo|\ wastrix|\ water\ conserve\ portal|\ water\ conserve\ spider|\ watzbot|\ wauuu|\ wavefire|\ waypath|\ wazzup|\ wbdbot|\ web\ ceo\ online\ robot|\ web\ crawler|\ web\ downloader|\ web\ image\ collector|\ web\ link\ validator|\ web\ magnet|\ web\ site\ downloader|\ web\ sucker|\ web\-agent|\ web\-sniffer|\ web\.image\.collector|\ webaltbot|\ webauto|\ webbot|\ webbul\-bot|\ webcapture|\ webcheck|\ webclipping\.com|\ webcollage|\ webcopier|\ webcopy|\ webcorp|\ webcrawl\.net|\ webcrawler|\ webdatacentrebot|\ webdownloader\ for\ x|\ webdup|\ webemailextrac|\ webenhancer|\ webfetch|\ webgather|\ webgo\ is|\ webgobbler|\ webimages|\ webinator\-search2|\ webinator\-wbi|\ webindex|\ weblayers|\ webleacher|\ weblexbot|\ weblinker|\ weblyzard|\ webmastercoffee|\ webmasterworld\ extractor|\ webmasterworldforumbot|\ webminer|\ webmoose|\ webot|\ webpix|\ webreaper|\ webripper|\ websauger|\ webscan|\ websearchbench|\ website|\ webspear|\ websphinx|\ webspider|\ webster|\ webstripper|\ webtrafficexpress|\ webtrends\ link\ analyzer|\ webvac|\ webwalk|\ webwasher|\ webwatch|\ webwhacker|\ webxm|\ webzip|\ weddings\.info|\ wenbin|\ wep\ search|\ wepa|\ werelatebot|\ wget|\ whacker|\ whirlpool\ web\ engine|\ whowhere\ robot|\ widow|\ wikiabot|\ wikio|\ wikiwix\-bot\-|\ winhttp|\ wire|\ wisebot|\ wisenutbot|\ wish\-la|\ wish\-project|\ wisponbot|\ wmcai\-robot|\ wminer|\ wmsbot|\ woriobot|\ worldshop|\ worqmada|\ wotbox|\ wume\_crawler|\ www\ collector|\ www\-collector\-e|\ www\-mechanize|\ wwwoffle|\ wwwrobot|\ wwwster|\ wwwwanderer|\ wwwxref|\ wysigot|\ x\-clawler|\ x\-crawler|\ xaldon|\ xenu|\ xerka\ metabot|\ xerka\ webbot|\ xget|\ xirq|\ xmarksfetch|\ xqrobot|\ y!j|\ yacy\.net|\ yacybot|\ yanga\ worldsearch\ bot|\ yarienavoir\.net|\ yasaklibot|\ yats\ crawler|\ ybot|\ yebolbot|\ yellowjacket|\ yeti|\ yolinkbot|\ yooglifetchagent|\ yoono|\ yottacars\_bot|\ yourls|\ z\-add\ link\ checker|\ zagrebin|\ zao|\ zedzo\.validate|\ zermelo|\ zeus|\ zibber\-v|\ zimeno|\ zing-bottabot|\ zipppbot|\ zongbot|\ zoomspider|\ zotag\ search|\ zsebot|\ zuibot|\ zyborg|\ zyte\ [NC] RewriteRule .* - [F] #bad bots end
Please like or follow us on Facebook or Twitter for updates for this list!
Nothing on your list stopped a bot or something that keeps hitting my site. The IP of the thing hitting my site is 185.*.*.*. I’ve tried other statements in htaccess and nothing works.
You need to block by IP then. Put this into htaccess:
Order Allow,Deny
Allow from all
Deny from 185.0.0.0/8
Use with care, it is easy to block half of the Internet this way!
185.0.0.0/8 is 16,777,214 IP adresses !!
I really would not advise on blocking a full /8 range !
Anyway great list, here’s some vital additions.
Most active hacktools
Mozilla/5.0 Jorgee <—- One of the most used Hack/Vulnerability Scan Bots
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1 <— Most active WordPress Hackbot
htaccess rewrite: Mozilla/5\.0\ \(Windows\ NT\ 6\.1;\ WOW64;\ rv\:40\.0\)\ Gecko/20100101\ Firefox/40\.1
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) <—- Another hack bot, not as widely used anymore
Widely used vulnerability/hack scanners:
sysscan/1.0
masscan/
Clone/Scrapers
FeedWordPress <– WordPress blog duping/cloning tool
PHP/.* <— Only ever seen malicious bots using a PHP user agent, should be blocked.
Thank you for your contribution. I will add your lines to the list and parse some logs with them. They will be included in the next release.
Hey Thanks guys
It was hard to find this but i knew someone was doing this.
Hi … Wouldn’t rate limiting be easier? .. This list looks pretty comprehensive which is good .. But I’m afraid the performance goes kaput if all webserver traffic is matched against this .. You would think humans are easier to spot based on behaviour and clicks per session … Just thinking out loud .. compliments for posting this! ๐
Gerard,
These are technically three long http access rules. Load on the servers is increased next to nothing when the rules are on. mod_security uses much more resources than the list. However, it is not as robust and can’t be compared to active access policy management systems.
As many new robots have been identified, the list will be updated soon.
Congratulations! Very good information. I am a newbie in computing and I have a problem. My htaccess file contains a strange expressions list and for this reason I do not know where to put the list of bad robots. Can you help me with this please? Sorry for my English.
Thank you in advance.
Best regards,
Juan
Hi Juan, in you htaccess, look for “Rewrite Engine On”, paste the robots snippet right after.
Thanks.
My htaccess is very long. RewriteEngine On appears three times!
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ – [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
# BEGIN W3TC Browser Cache
AddType text/css .css
AddType text/x-component .htc
AddType application/x-javascript .js
AddType application/javascript .js2
AddType text/javascript .js3
AddType text/x-js .js4
AddType video/asf .asf .asx .wax .wmv .wmx
AddType video/avi .avi
AddType image/bmp .bmp
Etc. etc. etc. etc.
I can try to put the robots snippet between: RewriteEngine On and RewriteBase /
or after
What do you think?
Thank you, Juan
Place the robots list right after the first RewriteEngine On and it will be fine.
Great list. On my Apache 2.2 server it works great.
But… on my Apache 2.4 server I get this error in the errorlog:
AH00124: Request exceeded the limit of 10 internal redirects due to probable configuration error. Use ‘LimitInternalRecursion’ to increase the limit if necessary. Use ‘LogLevel debug’ to get a backtrace.
Do you have a solution?
Thanks.
Could you toggle Log level to debug? What does the Apache log say then?
Hi There
It appears as if Google App (Maybe iOS only) uses “gsa” and MS Outlook (maybe 2016 only) uses “oso”.
My application did not want to load in the Google App on my phone, got it working by commenting out “gsa”.
My systems sends mail and contains an image link, the image did not load so I got it working by commenting out “oso”.
I would appreciate it if you could inspect and advise.
Great List ๐
Hi Jacques,
We will take a look an report back.
Thanks for your input.
Thanks! Thanks! and Thanks! ๐
Hi Awesome post.
I have a question, I would like to hide my robots.txt file from users and visitors but not from bots. How can I categorize them in .htaccess file to just deny users and not search bots. Can you guide me with the code?
Azar
Thanks for this fantastic list! Very much appreciated.
We added the list to our .htaccess. We have a 15 cron running on the server and it forbid the cron to run.
Any idea why? (Apache server). What did we do wrong? We really want to add it back because we get so many bot attacks, so would appreciate any assistance.
Sorry, meant to say 15 Minute cron running.
Hello Christine,
I had the same problem. Our cron uses the wget command. After removing the ‘wget’ from the list all crons were running fine.
You can give this a try as well.
wget entry on the list prevents from mirroring the whole site.
You may consider running wget with this parameter (copied from a running script).
wget –user-agent=”Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0″ REST of WGET CALL
Broke my site. Had to delete it. I’m running my site through Sucuri so perhaps this is a factor. I can actually block user agents in Sucuri, so if you have a list of user agents that are not formatted for the .htaccess file, this would be most appreciated. Cheers
Hello,
first thanks for the list.
I have removed the list from my system for the following reason but again.
On Android my Sleipnir browser was blocked. Of course you can add it manually to the list now, but who knows what for unknown browsers and system (future), still be blocked. the risk is too big.
Hello Steffen,
Thank you for pointing this out. By sharing information we can improve the list!
I am a user of a niche browser myself (Avant browser) and I assure you that when your browser popularity is 6% as of this Sleipnir you would keep a “secondary browser” handy in case your main one does not render a page correctly! This happens too often…
We are almost ready to release a new list with mauibot and other traffic “suckers” – stay tuned. We will address this in the release.
Hello Tab Studio,
Is your list current as of now? I will follow on Facebook to keep receiving updates, but wanted to know if I should wait until I follow to get the correct list?
The list is correct. Some of “traffic hungry” bots are missing there. The list has some flaws, as per comments here, some rare browsers from Japanese market are blocked and should be allowed.
Thank you ๐
When will you update the list? Based on comments below, there have been many great contributions, but do not see them in the list above.