Blocking robots: a huge list to block in .htaccess on your web site

Web crawlers visit websites every day. It is not so well known fact that more than half of the web traffic comes from robots, not from real users. By browsing through the server logs, one can encounter a great variety of robots that perform activities that are friendly or evil: scan pages for email addresses, search for vulnerabilities, collect data about the site, and index it in search engines (like Googlebot). Some robots scrape content from pages and place it on the Internet on their own address, and this is not good for our site. So there is a need of blocking robots…

Blocking robots

Many of these robots can be blocked. This gives you a handful of benefits: less spam, safer website, less stolen content. The amount of data tranferred in the hosting account is also decreasing as the blocked robots do not consume the transfer.

Blocking robots – how it is done?

There are several ways to block robots. You can block robots in robots.txt. Aggressive robots bypass this file, and therefore, another method is better, blocking robots by the agent name at the web server level. This way, the robot, if it uses any banned user agent, will simply be blocked and will receive the http 403 code – forbidden access. Of course, some robots impersonate of search engines or browsers and nothing can be done here, but many robots use standard Perl or Java user names and are easy to block. Our blocking method is fairly universal and works on most web hosting hosting .htaccess files. In fact, these are just 3 long lines in htaccess that do block robots, which does not burden the server very much.

How have we compiled the unwanted robots list?

Our experience shows that a viable way to block robots based on user agent is to list unwanted robots explicitly. Wwe have processed with regex logs of various websites from the last 10 years. Then we wrote a special program that checked out parts of the robot names so that the list is as short as possible. The result is a list of over 1800 robots we do not want. This list is constantly being used on various sites and is updated. Internet search engines such as Google, Bing, Yandex, Yahoo, as well as social networking sites such as Twitter and Facebook, have been removed from the blocking robots list, as we consider these bots useful.

All other robots are on the list. Because web servers do not allow so long lines in .htaccess, the file has been split into fragments containing about 500 robots names each. We have installed this file on many web sites, we had no problems with this file with any of the hosting companies.

Does blocking robots affect your search engine rankings?

Blocking robots in this form has no effect on SEO. We have a bunch of #1 pages that uses this blocking robots technique. Google uses its robot and also sometimes masquerades as popular browsers such as Chrome or Firefox to detect cloaking. All these robots are not blocked by our file. Google’s other robot names in our opinion do not appear.

How to block robots on your website?

At the end of this article there is a text to paste at the beginning of the .htaccess file of our website. Do not change anything in this text, every space and every enter is important. You can not add comments either. Just mark all the text below and paste it to the top of our .htaccess file, right after the “Rewrite Engine On” command (if you have this command in .htaccess). Additional note: in the file there must be a space before the entries [NC, OR] and [NC]. Sometimes it may happen that when you paste this space will disappear in the new file, without this space blocking robots will not work, although the server will not report any error.

How to check if we are blocking robots correctly?

The Xenu Link Sleuth program is blocked on the list. After uploading the .htaccess file to the server, you should not be able to load your page with Xenu, “forbidden request” appears. If this is the case, blocking robots works properly.

Copyright for blocking robots

The following list and the way to block the robots are provided completely free, in the hope that there will be less spam on the Internet. If the list works for you, you are using the list and you are happy, do not forget to like or share this page on Facebook or Twitter ?

Complaints ?

If the blocking robots list does not working properly, please inform us via the comments. Thank you for any suggestions with regards to the list. Periodically, an update version will appear here. The frequency will depend on the number of newly released spambots. New list publication will be published on Twitter, it is worth to watch us! We invite you too to use our services, and we will block the bad bots for you!

Update Dec 2017

Update – unlocked some Opera browsers. The have “edition” in the user string.

#bad bots start
#programmed by tab-studio.com public  version 2017.12
#1 new rule every 500 entries
RewriteCond %{HTTP_USER_AGENT} \
12soso|\
192\.comagent|\
1noonbot|\
1on1searchbot|\
3de\_search2|\
3d\_search|\
3g\ bot|\
3gse|\
50\.nu|\
a1\ sitemap\ generator|\
a1\ website\ download|\
a6\-indexer|\
aasp|\
abachobot|\
abonti|\
abotemailsearch|\
aboundex|\
aboutusbot|\
accmonitor\ compliance\ server|\
accoon|\
achulkov\.net\ page\ walker|\
acme\.spider|\
acoonbot|\
acquia\-crawler|\
activetouristbot|\
ad\ muncher|\
adamm\ bot|\
adbeat\_bot|\
adminshop\.com|\
advanced\ email\ extractor|\
aesop\_com\_spiderman|\
aespider|\
af\ knowledge\ now\ verity\ spider|\
aggregator:vocus|\
ah\-ha\.com\ crawler|\
ahrefs|\
aibot|\
aidu|\
aihitbot|\
aipbot|\
aisiid|\
aitcsrobot/1\.1|\
ajsitemap|\
akamai\-sitesnapshot|\
alexawebsearchplatform|\
alexfdownload|\
alexibot|\
alkalinebot|\
all\ acronyms\ bot|\
alpha\ search\ agent|\
amerla\ search\ bot|\
amfibibot|\
ampmppc\.com|\
amznkassocbot|\
anemone|\
anonymous|\
anotherbot|\
answerbot|\
answerbus|\
answerchase\ prove|\
antbot|\
antibot|\
antisantyworm|\
antro\.net|\
aonde\-spider|\
aport|\
appengine\-google|\
appid\:\ s\~stremor\-crawler\-|\
aqua\_products|\
arabot|\
arachmo|\
arachnophilia|\
archive\.org\_bot|\
aria\ equalizer|\
arianna\.libero\.it|\
arikus\_spider|\
art\-online\.com|\
artavisbot|\
artera|\
asaha\ search\ engine\ turkey|\
ask|\
aspider|\
aspseek|\
asterias|\
astrofind|\
athenusbot|\
atlocalbot|\
atomic\_email\_hunter|\
attach|\
attrakt|\
attributor|\
augurfind|\
auresys|\
autobaron\ crawler|\
autoemailspider|\
autowebdir|\
avsearch\-|\
axfeedsbot|\
axonize\-bot|\
ayna|\
b2w|\
backdoorbot|\
backrub|\
backstreet\ browser|\
backweb|\
baidu|\
bandit|\
batchftp|\
baypup|\
bdfetch|\
becomebot|\
becomejpbot|\
beetlebot|\
bender|\
besserscheitern\-crawl|\
betabot|\
big\ brother|\
big\ data|\
bigado\.com|\
bigcliquebot|\
bigfoot|\
biglotron|\
bilbo|\
bilgibetabot|\
bilgibot|\
bintellibot|\
bitlybot|\
bitvouseragent|\
bizbot003|\
bizbot04|\
bizworks\ retriever|\
black\ hole|\
black\.hole|\
blackbird|\
blackmask\.net\ search\ engine|\
blackwidow|\
bladder\ fusion|\
blaiz\-bee|\
blexbot|\
blinkx|\
blitzbot|\
blog\ conversation\ project|\
blogmyway|\
blogpulselive|\
blogrefsbot|\
blogscope|\
blogslive|\
bloobybot|\
blowfish|\
blt|\
bnf\.fr\_bot|\
boaconstrictor|\
boardreader|\
boia\-scan\-agent|\
boia\.org|\
boitho|\
boi\_crawl\_00|\
bookmark\ buddy\ bookmark\ checker|\
bookmark\ search\ tool|\
bosug|\
bot\ apoena|\
botalot|\
botrighthere|\
botswana|\
bottybot|\
bpbot|\
braintime\_search|\
brokenlinkcheck\.com|\
browseremulator|\
browsermob|\
bruinbot|\
bsearchr&d|\
bspider|\
btbot|\
btsearch|\
bubing|\
buddy|\
buibui|\
buildcms\ crawler|\
builtbottough|\
bullseye|\
bumblebee|\
bunnyslippers|\
buscadorclarin|\
buscaplus\ robi|\
butterfly|\
buyhawaiibot|\
buzzbot|\
byindia|\
byspider|\
byteserver|\
bzbot|\
c\ r\ a\ w\ l\ 3\ r|\
cacheblaster|\
caddbot|\
cafi|\
camcrawler|\
camelstampede|\
canon\-webrecord|\
careerbot|\
cataguru|\
catchbot|\
cazoodle|\
ccbot|\
ccgcrawl|\
ccubee|\
cd\-preload|\
ce\-preload|\
cegbfeieh|\
cerberian\ drtrs|\
cert\ figleafbot|\
cfetch|\
cfnetwork|\
chameleon|\
charlotte|\
check&get|\
checkbot|\
checklinks|\
cheesebot|\
chemiede\-nodebot|\
cherrypicker|\
chilkat|\
chinaclaw|\
cipinetbot|\
cis455crawler|\
citeseerxbot|\
cizilla|\
clariabot|\
climate\ ark|\
climateark\ spider|\
clshttp|\
clushbot|\
coast\ scan\ engine|\
coast\ webmaster\ pro|\
coccoc|\
collapsarweb|\
collector|\
colocrossing|\
combine|\
connectsearch|\
conpilot|\
contentsmartz|\
contextad\ bot|\
contype|\
cookienet|\
coolbot|\
coolcheck|\
copernic|\
copier|\
copyrightcheck|\
core\-project|\
cosmos|\
covario\-ids|\
cowbot\-|\
cowdog\ bot|\
crabbybot|\
craftbot\@yahoo\.com|\
crawler\.kpricorn\.org|\
crawler43\.ejupiter\.com|\
crawler4j|\
crawler@|\
crawler\_for\_infomine|\
crawly|\
crawl\_application|\
creativecommons|\
crescent|\
cs\-crawler|\
cse\ html\ validator|\
cshttpclient|\
cuasarbot|\
culsearch|\
curl|\
custo|\
cvaulev|\
cyberdog|\
cybernavi\_webget|\
cyberpatrol\ sitecat\ webbot|\
cyberspyder|\
cydralspider|\
d1garabicengine|\
datacha0s|\
datafountains|\
dataparksearch|\
dataprovider\.com|\
datascape\ robot|\
dataspearspiderbot|\
dataspider|\
dattatec\.com|\
daumoa|\
dblbot|\
dcpbot|\
declumbot|\
deepindex|\
deepnet\ crawler|\
deeptrawl|\
dejan|\
del\.icio\.us\-thumbnails|\
deltascan|\
delvubot|\
der\ gro§e\ bildersauger|\
der\ große\ bildersauger|\
deusu|\
dfs\-fetch|\
diagem|\
diamond|\
dibot|\
didaxusbot|\
digext|\
digger|\
digi\-rssbot|\
digitalarchivesbot|\
digout4u|\
diibot|\
dillo|\
dir\_snatch\.exe|\
disco|\
distilled\-reputation\-monitor|\
djangotraineebot|\
dkimrepbot|\
dmoz\ downloader|\
docomo|\
dof\-verify|\
domaincrawler|\
domainscan|\
domainwatcher\ bot|\
dotbot|\
dotspotsbot|\
dow\ jones\ searchbot|\
download|\
doy|\
dragonfly|\
drip|\
drone|\
dtaagent|\
dtsearchspider|\
dumbot|\
dwaar|\
dxseeker|\
e\-societyrobot|\
eah|\
earth\ platform\ indexer|\
earth\ science\ educator\ \ robot|\
easydl|\
ebingbong|\
ec2linkfinder|\
ecairn\-grabber|\
ecatch|\
echoosebot|\
edisterbot|\
edugovsearch|\
egothor|\
eidetica\.com|\
eirgrabber|\
elblindo\ the\ blind\ bot|\
elisabot|\
ellerdalebot|\
email\ exractor|\
emailcollector|\
emailleach|\
emailsiphon|\
emailwolf|\
emeraldshield|\
empas\_robot|\
enabot|\
endeca|\
enigmabot|\
enswer\ neuro\ bot|\
enter\ user\-agent|\
entitycubebot|\
erocrawler|\
estylesearch|\
esyndicat\ bot|\
eurosoft\-bot|\
evaal|\
eventware|\
everest\-vulcan\ inc\.|\
exabot|\
exactsearch|\
exactseek|\
exooba|\
exploder|\
express\ webpictures|\
extractor|\
eyenetie|\
ez\-robot|\
ezooms|\
f\-bot\ test\ pilot|\
factbot|\
fairad\ client|\
falcon|\
fast\ data\ search\ document\ retriever|\
fast\ esp|\
fast\-search\-engine|\
fastbot\ crawler|\
fastbot\.de\ crawler|\
fatbot|\
favcollector|\
faviconizer|\
favorites\ sweeper|\
fdm|\
fdse\ robot|\
fedcontractorbot|\
fembot|\
fetch\ api\ request|\
fetch\_ici|\
fgcrawler|\
filangy|\
filehound|\
findanisp\.com\_isp\_finder|\
findlinks|\
findweb|\
firebat|\
firstgov\.gov\ search|\
flaming\ attackbot|\
flamingo\_searchengine|\
flashcapture|\
flashget|\
flickysearchbot|\
fluffy\ the\ spider|\
flunky|\
focused\_crawler|\
followsite|\
foobot|\
fooooo\_web\_video\_crawl|\
fopper|\
formulafinderbot|\
forschungsportal|\
francis|\
freewebmonitoring\ sitechecker|\
freshcrawler|\
freshdownload|\
freshlinks\.exe|\
friendfeedbot|\
frodo\.at|\
froggle|\
frontpage|\
froola\ bot|\
fr\_crawler|\
fu\-nbi|\
full\_breadth\_crawler|\
funnelback|\
furlbot|\
g10\-bot|\
gaisbot|\
galaxybot|\
gazz|\
gbplugin|\
generate\_infomine\_category\_classifiers|\
genevabot|\
geniebot|\
genieo|\
geomaxenginebot|\
geometabot|\
geonabot|\
geovisu|\
germcrawler\ |\
gethtmlcontents|\
getleft|\
getright|\
getsmart|\
geturl\.rexx|\
getweb!|\
giant|\
gigablastopensource|\
gigabot|\
girafabot|\
gleamebot|\
gnome\-vfs|\
go!zilla|\
go\-ahead\-got\-it|\
go\-http\-client|\
goforit\.com|\
goforitbot|\
gold\ crawler|\
goldfire\ server|\
golem|\
goodjelly|\
gordon\-college\-google\-mini|\
goroam|\
goseebot|\
gotit|\
govbot|\
gpu\ p2p\ crawler|\
grabber|\
grabnet|\
grafula|\
grapefx|\
grapeshot|\
grbot|\
greenyogi|\
gromit|\
grub|\
gsa|\
gslfbot|\
gulliver|\
gulperbot|\
gurujibot|\
gvc\ business\ crawler|\
gvc\ crawler|\
gvc\ search\ bot|\
gvc\ web\ crawler|\
gvc\ weblink\ crawler|\
gvc\ world\ links|\
gvcbot\.com|\
happyfunbot|\
harvest|\
hatena\ antenna|\
hawler|\
hcat|\
hclsreport\-crawler|\
hd\ nutch\ agent|\
header\_test\_client|\
healia\
 [NC,OR]
#500 new rule
RewriteCond %{HTTP_USER_AGENT} \
helix|\
here\ will\ be\ link\ to\ crawler\ site|\
heritrix|\
hiscan|\
hisoftware\ accmonitor\ server|\
hisoftware\ accverify|\
hitcrawler|\
hivabot|\
hloader|\
hmsebot|\
hmview|\
hoge|\
holmes|\
homepagesearch|\
hooblybot\-image|\
hoowwwer|\
hostcrawler|\
hsft\ \\-\ link\ scanner|\
hsft\ \\-\ lvu\ scanner|\
hslide|\
ht://check|\
htdig|\
html\ link\ validator|\
htmlparser|\
httplib|\
httrack|\
huaweisymantecspider|\
hul\-wax|\
humanlinks|\
hyperestraier|\
hyperix|\
iaarchiver\-|\
ia\_archiver|\
ibuena|\
icab|\
icds\-ingestion|\
ichiro|\
icopyright\ conductor|\
ieautodiscovery|\
iecheck|\
ihwebchecker|\
iiitbot|\
iim\_405|\
ilsebot|\
iltrovatore|\
image\ stripper|\
image\ sucker|\
image\-fetcher|\
imagebot|\
imagefortress|\
imageshereimagesthereimageseverywhere|\
imagevisu|\
imds\_monitor|\
imo\-google\-robot\-intelink|\
inagist\.com\ url\ crawler|\
indexer|\
industry\ cortex\ webcrawler|\
indy\ library|\
indylabs\_marius|\
inelabot|\
inet32\ ctrl|\
inetbot|\
info\ seeker|\
infolink|\
infomine|\
infonavirobot|\
informant|\
infoseek\ sidewinder|\
infotekies|\
infousabot|\
ingrid|\
inktomi|\
insightscollector|\
insightsworksbot|\
inspirebot|\
insumascout|\
intelix|\
intelliseek|\
interget|\
internet\ ninja|\
internet\ radio\ crawler|\
internetlinkagent|\
interseek|\
ioi|\
ip\-web\-crawler\.com|\
ipadd\ bot|\
ipselonbot|\
ips\-agent|\
iria|\
irlbot|\
iron33|\
isara|\
isearch|\
isilox|\
istellabot|\
its\-learning\ crawler|\
iu\_csci\_b659\_class\_crawler|\
ivia|\
jadynave|\
java|\
jbot|\
jemmathetourist|\
jennybot|\
jetbot|\
jetbrains\ omea\ pro|\
jetcar|\
jim|\
jobo|\
jobspider\_ba|\
joc|\
joedog|\
joyscapebot|\
jspyda|\
junut\ bot|\
justview|\
jyxobot|\
k\.s\.bot|\
kakclebot|\
kalooga|\
katatudo\-spider|\
kbeta1|\
keepni\ web\ site\ monitor|\
kenjin\.spider|\
keybot\ translation\-search\-machine|\
keywenbot|\
keyword\ density|\
keyword\.density|\
kinjabot|\
kitenga\-crawler\-bot|\
kiwistatus|\
kmbot\-|\
kmccrew\ bot\ search|\
knight|\
knowitall|\
knowledge\ engine|\
knowledge\.com|\
koepabot|\
koninklijke|\
korniki|\
krowler|\
ksbot|\
kuloko\-bot|\
kulturarw3|\
kummhttp|\
kurzor|\
kyluka\ crawl|\
l\.webis|\
labhoo|\
labourunions411|\
lachesis|\
lament|\
lamerexterminator|\
lapozzbot|\
larbin|\
lbot|\
leaptag|\
leechftp|\
leechget|\
letscrawl\.com|\
lexibot|\
lexxebot|\
lftp|\
libcrawl|\
libiviacore|\
libw|\
likse|\
linguee\ bot|\
link\ checker|\
link\ validator|\
linkalarm|\
linkbot|\
linkcheck\ by\ siteimprove\.com|\
linkcheck\ scanner|\
linkchecker|\
linkdex\.com|\
linkextractorpro|\
linklint|\
linklooker|\
linkman|\
links\ sql|\
linkscan|\
linksmanager\.com\_bot|\
linksweeper|\
linkwalker|\
link\_checker|\
litefinder|\
litlrbot|\
little\ grabber\ at\ skanktale\.com|\
livelapbot|\
lm\ harvester|\
lmqueuebot|\
lnspiderguy|\
loadtimebot|\
localcombot|\
locust|\
lolongbot|\
lookbot|\
lsearch|\
lssbot|\
lt\ scotland\ checklink|\
ltx71.com|\
lwp|\
lycos\_spider|\
lydia\ entity\ spider|\
lynnbot|\
lytranslate|\
mag\-net|\
magnet|\
magpie\-crawler|\
magus\ bot|\
mail\.ru|\
mainseek\_bot|\
mammoth|\
map\ robot|\
markwatch|\
masagool|\
masidani\_bot\_|\
mass\ downloader|\
mata\ hari|\
mata\.hari|\
matentzn\ at\ cs\ dot\ man\ dot\ ac\ dot\ uk|\
maxamine\.com\-\-robot|\
maxamine\.com\-robot|\
maxomobot|\
mcbot|\
medrabbit|\
megite|\
memacbot|\
memo|\
mendeleybot|\
mercator\-|\
mercuryboard\_user\_agent\_sql\_injection\.nasl|\
metacarta|\
metaeuro\ web\ search|\
metager2|\
metagloss|\
metal\ crawler|\
metaquerier|\
metaspider|\
metaspinner|\
metauri|\
mfcrawler|\
mfhttpscan|\
midown\ tool|\
miixpc|\
mini\-robot|\
minibot|\
minirank|\
mirror|\
missigua\ locator|\
mister\ pix|\
mister\.pix|\
miva|\
mj12bot|\
mnogosearch|\
moduna\.com|\
mod\_accessibility|\
moget|\
mojeekbot|\
monkeycrawl|\
moses|\
mowserbot|\
mqbot|\
mse360|\
msindianwebcrawl|\
msmobot|\
msnptc|\
msrbot|\
mt\-soft|\
multitext|\
my\-heritrix\-crawler|\
myapp|\
mycompanybot|\
mycrawler|\
myengines\-us\-bot|\
myfamilybot|\
myra|\
my\_little\_searchengine\_project|\
nabot|\
najdi\.si|\
nambu|\
nameprotect|\
nasa\ search|\
natchcvs|\
natweb\-bad\-link\-mailer|\
naver|\
navroad|\
nearsite|\
nec\-meshexplorer|\
neosciocrawler|\
nerdbynature\.bot|\
nerdybot|\
nerima\-crawl-|\
nessus|\
nestreader|\
net\ vampire|\
net::trackback|\
netants|\
netcarta\ cyberpilot\ pro|\
netcraft|\
netexperts|\
netid\.com\ bot|\
netmechanic|\
netprospector|\
netresearchserver|\
netseer|\
netshift=|\
netsongbot|\
netsparker|\
netspider|\
netsrcherp|\
netzip|\
newmedhunt|\
news\ bot|\
newsgatherer|\
newsgroupreporter|\
newstrovebot|\
news\_search\_app|\
nextgensearchbot|\
nextthing\.org|\
nicebot|\
nicerspro|\
niki\-bot|\
nimblecrawler|\
nimbus\-1|\
ninetowns|\
ninja|\
njuicebot|\
nlese|\
nogate|\
norbert\ the\ spider|\
noteworthybot|\
npbot|\
nrcan\ intranet\ crawler|\
nsdl\_search\_bot|\
nuggetize\.com\ bot|\
nusearch\ spider|\
nutch|\
nu\_tch|\
nwspider|\
nymesis|\
nys\-crawler|\
objectssearch|\
obot|\
obvius\ external\ linkcheck|\
ocelli|\
octopus|\
odp\ entries\ t\_st|\
oegp|\
offline\ navigator|\
offline\.explorer|\
ogspider|\
omiexplorer\_bot|\
omniexplorer|\
omnifind|\
omniweb|\
onetszukaj|\
online\ link\ validator|\
oozbot|\
openbot|\
openfind|\
openintelligencedata|\
openisearch|\
openlink\ virtuoso\ rdf\ crawler|\
opensearchserver\_bot|\
opidig|\
optidiscover|\
oracle\ secure\ enterprise\ search|\
oracle\ ultra\ search|\
orangebot|\
orisbot|\
ornl\_crawler|\
ornl\_mercury|\
osis\-project\.jp|\
oso|\
outfoxbot|\
outfoxmelonbot|\
owler\-bot|\
owsbot|\
ozelot|\
p3p\ client|\
pagebiteshyperbot|\
pagebull|\
pagedown|\
pagefetcher|\
pagegrabber|\
pagepeeker|\
pagerank\ monitor|\
page\_verifier|\
pamsnbot\.htm|\
panopy\ bot|\
panscient\.com|\
pansophica|\
papa\ foto|\
paperlibot|\
parasite|\
parsijoo|\
pathtraq|\
pattern|\
patwebbot|\
pavuk|\
paxleframework|\
pbbot|\
pcbrowser|\
pcore\-http|\
pd\-crawler|\
penthesila|\
perform\_crawl|\
perman|\
personal\ ultimate\ crawler|\
php\ version\ tracker|\
phpcrawl|\
phpdig|\
picosearch|\
pieno\ robot|\
pipbot|\
pipeliner|\
pita|\
pixfinder|\
piyushbot|\
planetwork\ bot\ search|\
plucker|\
plukkie|\
plumtree|\
pockey|\
pocohttp|\
pogodak\.ba|\
pogodak\.co\.yu|\
poirot|\
polybot|\
pompos|\
poodle\ predictor|\
popscreenbot|\
postpost|\
privacyfinder|\
projectwf\-java\-test\-crawler|\
propowerbot|\
prowebwalker|\
proxem\ websearch|\
proximic|\
proxy\ crawler|\
psbot|\
pss\-bot|\
psycheclone|\
pub\-crawler|\
pucl|\
pulsebot|\
pump|\
pwebot|\
python|\
qeavis\ agent|\
qfkbot|\
qualidade|\
qualidator\.com\ bot|\
quepasacreep|\
queryn\ metasearch|\
queryn\.metasearch|\
quest\.durato|\
quintura\-crw|\
qunarbot|\
qwantify|\
qweerybot|\
qweery\_robot\.txt\_checkbot|\
r2ibot|\
r6\_commentreader|\
r6\_feedfetcher|\
r6\_votereader|\
rabot|\
radian6|\
radiation\ retriever|\
rampybot|\
rankivabot|\
rankur|\
rational\ sitecheck|\
rcstartbot|\
realdownload|\
reaper|\
rebi\-shoveler|\
recorder|\
redbot|\
redcarpet|\
reget|\
repomonkey|\
research\ robot|\
riddler|\
riight|\
risenetbot|\
riverglassscanner\
 [NC,OR]

#1000 new rule
RewriteCond %{HTTP_USER_AGENT} \
robopal|\
robosourcer|\
robotek|\
robozilla|\
roger|\
rome\ client|\
rondello|\
rotondo|\
roverbot|\
rpt\-httpclient|\
rtgibot|\
rufusbot|\
runnk\ online\ rss\ reader|\
runnk\ rss\ aggregator|\
s2bot|\
safaribookmarkchecker|\
safednsbot|\
safetynet\ robot|\
saladspoon|\
sapienti|\
sapphireweb|\
sbider|\
sbl\-bot|\
scfcrawler|\
scich|\
scientificcommons\.org|\
scollspider|\
scooperbot|\
scooter|\
scoutjet|\
scrapebox|\
scrapy|\
scrawltest|\
screaming\ frog|\
scrubby|\
scspider|\
scumbot|\
search\ publisher|\
search\ x\-bot|\
search\-channel|\
search\-engine\-studio|\
search\.kumkie\.com|\
search\.updated\.com|\
search\.usgs\.gov|\
searcharoo\.net|\
searchblox|\
searchbot|\
searchengine|\
searchhippo\.com|\
searchit\-bot|\
searchmarking|\
searchmarks|\
searchmee!|\
searchmee\_v|\
searchmining|\
searchnowbot|\
searchpreview|\
searchspider\.com|\
searqubot|\
seb\ spider|\
seekbot|\
seeker\.lookseek\.com|\
seeqbot|\
seeqpod\-vertical\-crawler|\
selflinkchecker|\
semager|\
semanticdiscovery|\
semantifire|\
semisearch|\
semrushbot|\
seoengworldbot|\
seokicks|\
seznambot|\
shablastbot|\
shadowwebanalyzer|\
shareaza|\
shelob|\
sherlock|\
shim\-crawler|\
shopsalad|\
shopwiki|\
showlinks|\
showyoubot|\
siclab|\
silk|\
simplepie|\
siphon|\
sitebot|\
sitecheck|\
sitefinder|\
siteguardbot|\
siteorbiter|\
sitesnagger|\
sitesucker|\
sitesweeper|\
sitexpert|\
skimbot|\
skimwordsbot|\
skreemrbot|\
skywalker|\
sleipnir|\
slow\-crawler|\
slysearch|\
smart\-crawler|\
smartdownload|\
smarte\ bot|\
smartwit\.com|\
snake|\
snap\.com\ beta\ crawler|\
snapbot|\
snappreviewbot|\
snappy|\
snookit|\
snooper|\
snoopy|\
societyrobot|\
socscibot|\
soft411\ directory|\
sogou|\
sohu\ agent|\
sohu\-search|\
sokitomi\ crawl|\
solbot|\
sondeur|\
sootle|\
sosospider|\
space\ bison|\
space\ fung|\
spacebison|\
spankbot|\
spanner|\
spatineo\ monitor\ controller|\
spatineo\ serval\ controller|\
spatineo\ serval\ getmapbot|\
special\_archiver|\
speedy|\
sphere\ scout|\
sphider|\
spider\.terranautic\.net|\
spiderengine|\
spiderku|\
spiderman|\
spinn3r|\
spinne|\
sportcrew\-bot|\
sproose|\
spyder3\.microsys\.com|\
sq\ webscanner|\
sqlmap|\
squid\-prefetch|\
squidclamav\_redirector|\
sqworm|\
srevbot|\
sslbot|\
ssm\ agent|\
stackrambler|\
stardownloader|\
statbot|\
statcrawler|\
statedept\-crawler|\
steeler|\
stegmann\-bot|\
stero|\
stripper|\
stumbler|\
suchclip|\
sucker|\
sumeetbot|\
sumitbot|\
summizebot|\
summizefeedreader|\
sunrise\ xp|\
superbot|\
superhttp|\
superlumin\ downloader|\
superpagesbot|\
supremesearch\.net|\
supybot|\
surdotlybot|\
surf|\
surveybot|\
suzuran|\
swebot|\
swish\-e|\
sygolbot|\
synapticwalker|\
syntryx\ ant\ scout\ chassis\ pheromone|\
systemsearch\-robot|\
szukacz|\
s\~stremor\-crawler|\
t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|\
tailrank|\
takeout|\
talkro\ web\-shot|\
tamu\_crawler|\
tapuzbot|\
tarantula|\
targetblaster\.com|\
targetyournews\.com\ bot|\
tausdatabot|\
taxinomiabot|\
teamsoft\ wininet\ component|\
tecomi\ bot|\
teezirbot|\
teleport|\
telesoft|\
teradex\ mapper|\
teragram\_crawler|\
terrawizbot|\
testbot|\
testing\ of\ bot|\
textbot|\
thatrobotsite\.com|\
the\ dyslexalizer|\
the\ intraformant|\
the\.intraformant|\
thenomad|\
theophrastus|\
theusefulbot|\
thumbbot|\
thumbnail\.cz\ robot|\
thumbshots\-de\-bot|\
tigerbot|\
tighttwatbot|\
tineye|\
titan|\
to\-dress\_ru\_bot\_|\
to\-night\-bot|\
tocrawl|\
topicalizer|\
topicblogs|\
toplistbot|\
topserver\ php|\
topyx\-crawler|\
touche|\
tourlentascanner|\
tpsystem|\
traazi|\
transgenikbot|\
travel\-search|\
travelbot|\
travellazerbot|\
treezy|\
trendiction|\
trex|\
tridentspider|\
trovator|\
true\_robot|\
tscholarsbot|\
tsm\ translation\-search\-machine|\
tswebbot|\
tulipchain|\
turingos|\
turnitinbot|\
tutorgigbot|\
tweetedtimes\ bot|\
tweetmemebot|\
twengabot|\
twice|\
twikle|\
twinuffbot|\
twisted\ pagegetter|\
twitturls|\
twitturly|\
tygobot|\
tygoprowler|\
typhoeus|\
u\.s\.\ government\ printing\ office|\
uberbot|\
ucb\-nutch|\
udmsearch|\
ufam\-crawler\-|\
ultraseek|\
unchaos|\
unisterbot|\
unidentified|\
unitek\ uniengine|\
universalsearch|\
unwindfetchor|\
uoftdb\_experiment|\
updated|\
url\ control|\
url\-checker|\
urlappendbot|\
urlblaze|\
urlchecker|\
urlck|\
urldispatcher|\
urlspiderpro|\
urly\ warning|\
urly\.warning|\
url\_gather|\
usaf\ afkn\ k2spider|\
usasearch|\
uss\-cosmix|\
usyd\-nlp\-spider|\
vacobot|\
vacuum|\
vadixbot|\
vagabondo|\
validator|\
valkyrie|\
vbseo|\
vci\ webviewer\ vci\ webviewer\ win32|\
verbstarbot|\
vericitecrawler|\
verifactrola|\
verity\-url\-gateway|\
vermut|\
versus\ crawler|\
versus\.integis\.ch|\
viasarchivinginformation\.html|\
vipr|\
virus\-detector|\
virus\_detector|\
visbot|\
vishal\ for\ clia|\
visweb|\
vital\ search'n\ urchin|\
vlad|\
vlsearch|\
voilabot|\
vmbot|\
vocusbot|\
voideye|\
voil|\
vortex|\
voyager|\
vspider|\
w3c\-webcon|\
w3c\_unicorn|\
w3search|\
wacbot|\
wanadoo|\
wastrix|\
water\ conserve\ portal|\
water\ conserve\ spider|\
watzbot|\
wauuu|\
wavefire|\
waypath|\
wazzup|\
wbdbot|\
web\ ceo\ online\ robot|\
web\ crawler|\
web\ downloader|\
web\ image\ collector|\
web\ link\ validator|\
web\ magnet|\
web\ site\ downloader|\
web\ sucker|\
web\-agent|\
web\-sniffer|\
web\.image\.collector|\
webaltbot|\
webauto|\
webbot|\
webbul\-bot|\
webcapture|\
webcheck|\
webclipping\.com|\
webcollage|\
webcopier|\
webcopy|\
webcorp|\
webcrawl\.net|\
webcrawler|\
webdatacentrebot|\
webdownloader\ for\ x|\
webdup|\
webemailextrac|\
webenhancer|\
webfetch|\
webgather|\
webgo\ is|\
webgobbler|\
webimages|\
webinator\-search2|\
webinator\-wbi|\
webindex|\
weblayers|\
webleacher|\
weblexbot|\
weblinker|\
weblyzard|\
webmastercoffee|\
webmasterworld\ extractor|\
webmasterworldforumbot|\
webminer|\
webmoose|\
webot|\
webpix|\
webreaper|\
webripper|\
websauger|\
webscan|\
websearchbench|\
website|\
webspear|\
websphinx|\
webspider|\
webster|\
webstripper|\
webtrafficexpress|\
webtrends\ link\ analyzer|\
webvac|\
webwalk|\
webwasher|\
webwatch|\
webwhacker|\
webxm|\
webzip|\
weddings\.info|\
wenbin|\
wep\ search|\
wepa|\
werelatebot|\
wget|\
whacker|\
whirlpool\ web\ engine|\
whowhere\ robot|\
widow|\
wikiabot|\
wikio|\
wikiwix\-bot\-|\
winhttp|\
wire|\
wisebot|\
wisenutbot|\
wish\-la|\
wish\-project|\
wisponbot|\
wmcai\-robot|\
wminer|\
wmsbot|\
woriobot|\
worldshop|\
worqmada|\
wotbox|\
wume\_crawler|\
www\ collector|\
www\-collector\-e|\
www\-mechanize|\
wwwoffle|\
wwwrobot|\
wwwster|\
wwwwanderer|\
wwwxref|\
wysigot|\
x\-clawler|\
x\-crawler|\
xaldon|\
xenu|\
xerka\ metabot|\
xerka\ webbot|\
xget|\
xirq|\
xmarksfetch|\
xqrobot|\
y!j|\
yacy\.net|\
yacybot|\
yanga\ worldsearch\ bot|\
yarienavoir\.net|\
yasaklibot|\
yats\ crawler|\
ybot|\
yebolbot|\
yellowjacket|\
yeti|\
yolinkbot|\
yooglifetchagent|\
yoono|\
yottacars\_bot|\
yourls|\
z\-add\ link\ checker|\
zagrebin|\
zao|\
zedzo\.validate|\
zermelo|\
zeus|\
zibber\-v|\
zimeno|\
zing-bottabot|\
zipppbot|\
zongbot|\
zoomspider|\
zotag\ search|\
zsebot|\
zuibot|\
zyborg|\
zyte\
 [NC]
RewriteRule .* - [F]
#bad bots end

Please like or follow us on Facebook or Twitter for updates for this list!

27 thoughts on “Blocking robots on your web page – the list of 1800 bad bots”

Reply ↓
Dr Craig Murray 9 January 2018 at 21:23
Nothing on your list stopped a bot or something that keeps hitting my site. The IP of the thing hitting my site is 185.*.*.*. I’ve tried other statements in htaccess and nothing works.
- Reply ↓
  TAB Studio 9 January 2018 at 22:57
  You need to block by IP then. Put this into htaccess:
  Order Allow,Deny
  Allow from all
  Deny from 185.0.0.0/8
  Use with care, it is easy to block half of the Internet this way!
  - Reply ↓
    StonedZebra 11 January 2018 at 19:35
    185.0.0.0/8 is 16,777,214 IP adresses !!
    I really would not advise on blocking a full /8 range !
    Anyway great list, here’s some vital additions.
    Most active hacktools
    Mozilla/5.0 Jorgee <—- One of the most used Hack/Vulnerability Scan Bots
    Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1 <— Most active WordPress Hackbot
    htaccess rewrite: Mozilla/5\.0\ $Windows\ NT\ 6\.1;\ WOW64;\ rv\:40\.0$\ Gecko/20100101\ Firefox/40\.1
    Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) <—- Another hack bot, not as widely used anymore
    Widely used vulnerability/hack scanners:
    sysscan/1.0
    masscan/
    Clone/Scrapers
    FeedWordPress <– WordPress blog duping/cloning tool
    PHP/.* <— Only ever seen malicious bots using a PHP user agent, should be blocked.
    - Reply ↓
      TAB Studio 11 January 2018 at 20:53
      Thank you for your contribution. I will add your lines to the list and parse some logs with them. They will be included in the next release.
Reply ↓
kat 20 February 2018 at 21:22
Hey Thanks guys
It was hard to find this but i knew someone was doing this.
Reply ↓
Gerard 10 March 2018 at 20:24
Hi … Wouldn’t rate limiting be easier? .. This list looks pretty comprehensive which is good .. But I’m afraid the performance goes kaput if all webserver traffic is matched against this .. You would think humans are easier to spot based on behaviour and clicks per session … Just thinking out loud .. compliments for posting this! 😉
- Reply ↓
  TAB Studio 10 March 2018 at 21:32
  Gerard,
  These are technically three long http access rules. Load on the servers is increased next to nothing when the rules are on. mod_security uses much more resources than the list. However, it is not as robust and can’t be compared to active access policy management systems.
  As many new robots have been identified, the list will be updated soon.
Reply ↓
Juan Santacruz 24 April 2018 at 19:47
Congratulations! Very good information. I am a newbie in computing and I have a problem. My htaccess file contains a strange expressions list and for this reason I do not know where to put the list of bad robots. Can you help me with this please? Sorry for my English.
Thank you in advance.
Best regards,
Juan
- Reply ↓
  TAB Studio 24 April 2018 at 21:27
  Hi Juan, in you htaccess, look for “Rewrite Engine On”, paste the robots snippet right after.
  - Reply ↓
    Juan Santacruz 25 April 2018 at 12:44
    Thanks.
    My htaccess is very long. RewriteEngine On appears three times!
    RewriteEngine On
    RewriteBase /
    RewriteRule ^index\.php$ – [L]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . /index.php [L]
    # BEGIN W3TC Browser Cache
    AddType text/css .css
    AddType text/x-component .htc
    AddType application/x-javascript .js
    AddType application/javascript .js2
    AddType text/javascript .js3
    AddType text/x-js .js4
    AddType video/asf .asf .asx .wax .wmv .wmx
    AddType video/avi .avi
    AddType image/bmp .bmp
    Etc. etc. etc. etc.
    I can try to put the robots snippet between: RewriteEngine On and RewriteBase /
    or after
    What do you think?
    Thank you, Juan
Reply ↓
TAB Studio 25 April 2018 at 18:33
Place the robots list right after the first RewriteEngine On and it will be fine.
Reply ↓
PROMES 4 June 2018 at 12:51
Great list. On my Apache 2.2 server it works great.
But… on my Apache 2.4 server I get this error in the errorlog:
AH00124: Request exceeded the limit of 10 internal redirects due to probable configuration error. Use ‘LimitInternalRecursion’ to increase the limit if necessary. Use ‘LogLevel debug’ to get a backtrace.
Do you have a solution?
Thanks.
- Reply ↓
  TAB Studio 10 June 2018 at 23:00
  Could you toggle Log level to debug? What does the Apache log say then?
Reply ↓
Jacques 13 June 2018 at 16:44
Hi There
It appears as if Google App (Maybe iOS only) uses “gsa” and MS Outlook (maybe 2016 only) uses “oso”.
My application did not want to load in the Google App on my phone, got it working by commenting out “gsa”.
My systems sends mail and contains an image link, the image did not load so I got it working by commenting out “oso”.
I would appreciate it if you could inspect and advise.
Great List 🙂
- Reply ↓
  TAB Studio 13 June 2018 at 23:21
  Hi Jacques,
  We will take a look an report back.
  Thanks for your input.
Reply ↓
Mr.Rabbit 19 June 2018 at 09:24
Thanks! Thanks! and Thanks! 😀
- Reply ↓
  Azarudeen Abdul Khader 29 June 2018 at 00:11
  Hi Awesome post.
  I have a question, I would like to hide my robots.txt file from users and visitors but not from bots. How can I categorize them in .htaccess file to just deny users and not search bots. Can you guide me with the code?
  Azar
Reply ↓
Christine 10 July 2018 at 16:39
Thanks for this fantastic list! Very much appreciated.
We added the list to our .htaccess. We have a 15 cron running on the server and it forbid the cron to run.
Any idea why? (Apache server). What did we do wrong? We really want to add it back because we get so many bot attacks, so would appreciate any assistance.
Sorry, meant to say 15 Minute cron running.
- Reply ↓
  PROMES 23 February 2019 at 12:39
  Hello Christine,
  I had the same problem. Our cron uses the wget command. After removing the ‘wget’ from the list all crons were running fine.
  You can give this a try as well.
  - Reply ↓
    TAB Studio 23 February 2019 at 20:35
    wget entry on the list prevents from mirroring the whole site.
    You may consider running wget with this parameter (copied from a running script).
    wget –user-agent=”Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0″ REST of WGET CALL
Reply ↓
Gus 24 October 2018 at 00:29
Broke my site. Had to delete it. I’m running my site through Sucuri so perhaps this is a factor. I can actually block user agents in Sucuri, so if you have a list of user agents that are not formatted for the .htaccess file, this would be most appreciated. Cheers
Reply ↓
Steffen 16 November 2018 at 11:28
Hello,
first thanks for the list.
I have removed the list from my system for the following reason but again.
On Android my Sleipnir browser was blocked. Of course you can add it manually to the list now, but who knows what for unknown browsers and system (future), still be blocked. the risk is too big.
- Reply ↓
  TAB Studio 16 November 2018 at 19:18
  Hello Steffen,
  Thank you for pointing this out. By sharing information we can improve the list!
  I am a user of a niche browser myself (Avant browser) and I assure you that when your browser popularity is 6% as of this Sleipnir you would keep a “secondary browser” handy in case your main one does not render a page correctly! This happens too often…
  We are almost ready to release a new list with mauibot and other traffic “suckers” – stay tuned. We will address this in the release.
Reply ↓
Doljr 24 November 2018 at 19:41
Hello Tab Studio,
Is your list current as of now? I will follow on Facebook to keep receiving updates, but wanted to know if I should wait until I follow to get the correct list?
- Reply ↓
  TAB Studio 29 November 2018 at 22:00
  The list is correct. Some of “traffic hungry” bots are missing there. The list has some flaws, as per comments here, some rare browsers from Japanese market are blocked and should be allowed.
Reply ↓
Rafael Rossi 29 November 2018 at 14:17
Thank you 🙂
Reply ↓
Mendy M Ouzillou 17 December 2018 at 00:33
When will you update the list? Based on comments below, there have been many great contributions, but do not see them in the list above.

Call us UK phone :+44 774 397 2697

Call us UK phone :+44 774 397 2697

Leave a comment Cancel reply

27 thoughts on “Blocking robots on your web page – the list of 1800 bad bots”