Blocks from strings

Nevin Pratt nevin at smalltalkpro.com
Thu Jan 1 23:37:03 UTC 2004


Lukas Renggli wrote:

> I suppose you are working either with the referee-header or you are
>
>remembering the IP of the spider accessing robots.txt. Are you sure this 
>actually works? 
>
>As far as I know from Google they are running their spiders on a cluster 
>of linux boxes. A site isn't scanned all at once, but every page is 
>scheduled, fetched and indexed from different machines with different 
>IPs. Are you using a different trick to keep the 'isSpider' information?
>
>Cheers,
>Lukas
>
>
>  
>

I am using Cee's spider detection logic from his Janus package.  That 
logic works like this:

The "user-agent" field is retrieved from the header of the request, and 
tested via #userAgentIsSpider:, which in turn uses 
#knownSpiderUserAgentPatterns.  The code to those two methods is at the 
end of this message.  If #userAgentIsSpider: returns true, then the 
request is from a spider.

Additionally (and independently of the #userAgentIsSpider: test), there 
is an "autoSpider" detection feature that works like this:  if any given 
request asks for the "robots.txt" file, then the request is assumed to 
be coming from a spider.  The user agent and the IP of such a request is 
then squirreled away, and for all additional requests from that same 
user-agent and IP, they are all assumed to be from the same spider, up 
to a time limit of one hour since the last request from that IP/user-agent.

If anybody has any ideas for spider-detection improvement, I'm sure that 
both Cee's and I would be very interested.

Lukas, I'm guessing based on your response above, that a better 
algorithm for the "autoSpider" detection feature might be to only use 
the first 3 numbers of the IP instead of all 4 numbers for the 
"autoSpider" feature.  Thus, if, say, a request comes in for a 
"robots.txt" file, and it has a "FooBar" for the user-agent field, and 
the IP of that request is, say, 10.25.50.75, then any subsequent 
requests from any IP beginning with 10.25.50 that also have "FooBar" for 
the user-agent field would be deemed to be from the same spider.

What does everybody think?

Nevin

-- 
Nevin Pratt
Bountiful Baby
http://www.bountifulbaby.com
(801) 992-3137


******************************************
userAgentIsSpider: ua 
    self knownSpiderUserAgentPatterns
        do: [:each | (each match: ua)
                ifTrue: [^ true]].
    ^ false



******************************************
knownSpiderUserAgentPatterns
    "Mail enhancements to cg at tric.nl"
    ^ #('''IndexTheWeb.com Crawler7''' '*Teradex Mapper*' 'ASPseek/*'
'AcontBot' 'AlkalineBOT/*' 'AmfibiBOT' 'Aruyo/*' 'Aspseek*'
'AstraSpider++ Ver*' 'AustroNaut DeepBlue' 'BBCi Searchbot*' 'BBot/*'
'BE-Crawler' 'BaiDuSpider' 'Baiduspider+*' 'BeepSearch' 'Buscaplus
Robi/*' 'CJ Spider/*' 'Caddbot/*' 'CheckUrl' 'Checkbot/*' 'CipinetBot*'
'Cityreview Robot*' 'Comodo HTTP(S) Crawler*' 'Convera Internet
Spider*' 'CyberSpyder Link Test/*' 'DYPcheck/*' 'DeadLinkCheck/*' 'Deep
Link Tester*' 'DeepIndex*' 'Der Bot aus Poppelsdorf*'
'EasyWebPromotion*' 'Educate Search*' 'EgotoBot/*'
'EnriqueElRobotdeMirago' 'ExactSeek Crawler/*' 'FAST Data Search
Crawler*' 'FAST-RealWebCrawler/*' 'FAST-WebCrawler/*' 'Fast PartnerSite
Crawler' 'Find Link Check Spider' 'Find Link Check Spider Crawler'
'Find LinkChecker Web Crawler Spider Gatherer' 'Gaisbot/*'
'GalaxyBot/*' 'Gather' 'Geobot/*' 'GetURL/*' 'Gigabot/*' 'Goblin/*'
'GoogleBot' 'Googlebot/*' 'GornKer Crawler' 'GrigorBot *' 'HALO, the
magical bot' 'HenriLeRobotMirago' 'HenryTheMiragoRobot*' 'Hitwise
Spider*' 'Html Link Validator*' 'IPiumBot*' 'IndexTheWeb.com Crawler*'
'InfoNaviRobot*' 'Infomine Virtual Library Crawler/*' 'Infoseek
SideWinder/*' 'Infosniff Sniffer' 'Inktomi Search' 'Insitor, Search
engine of gods' 'InternetSeer.com' 'Jabot/*' 'LWP::Simple/*'
'LexiBot/*' 'LiVe Link Verificator' 'LinkAlarm/*'
'LinkLint-checkonly/*' 'LinkScan Server/*' 'LinkScan/*' 'LinkSweeper/*'
'LinkWalker' 'Linkbot*' 'ListBidBot*' 'Lycos-News-Xml-Fetcher'
'Lycos_Spider_(modspider)' 'MOMspider/*' 'MSNBOT/*'
'Mediapartners-Google/*' 'MetaGer-LinkChecker' 'MetaTagRobot/*'
'MnogoSearch/*' 'NCSA Beta 1*' 'NPBot' 'NUTOMI_BOT'
'NationalDirectory-WebSpider/*' 'NetResearchServer*' 'Novell Web Search
Indexer''Openbot/3.*' 'Openfind*' 'Overture-WebCrawler/*' 'Pelusita
Spider*' 'QweeryBot*' 'RCcrawler*' 'RoboCrawl*' 'Robot/*' 'RobotAgent'
'RobotMidareru/*' 'Scrubby/*' 'SearchSpider*' 'Seeker.lookseek.com'
'SightQuestBot/*' '*(Slurp*' 'SpiderKU/0.9' 'SpiderS1' 'Spinne/*'
'Sqworm/*' 'StackRambler/*' 'SurveyBot/*' 'Szukacz/1*'
'Terrar-UK_Search*' 'TopSpots-Autobot/*' 'TrapScanner/*'
'TurnitinBot/*' 'URLSpiderPro/*' 'URL_Spider_Pro/*' 'UdmSearch'
'UdmSearch/*' 'Ultraseek' 'Vindex *' 'WMWWebBot' 'WebRACE/*' 'WebReaper
*' 'WebSauger *' 'WebTrends Link Analyzer' 'WebTrends/*' 'Wget/*'
'Willow Internet Crawler *' 'Xaldon WebSpider *' 'YellSpider' 'Zoek.nl
N/G *' '*ZyBorg/1.0*' 'amibot' 'fscrawler/*' 'htdig/*' 'ia_archiver'
'ilse/*' 'k2spider' 'libwww-perl/*' 'minibot(NaverRobot)/*'
'mogimogi/1.0' 'nabot' 'rnmcrawler' 'search.ch *' 'searchukBot *'
'semanticdiscovery/*' 'sitecheck.internetseer.com *' 'snarf/*'
'spider at spider.ilab.sztaki.hu *' 'toverobot/*' 'tumba!-viuva negra/*'
'uksearchpages.co.uk' 'verzamelgids.nl - Networking4all Bot/*'
'vspider' 'webrank' 'woelmuis.nl' )









More information about the Squeak-dev mailing list