Blocks from strings
Nevin Pratt
nevin at smalltalkpro.com
Thu Jan 1 23:37:03 UTC 2004
Lukas Renggli wrote:
> I suppose you are working either with the referee-header or you are
>
>remembering the IP of the spider accessing robots.txt. Are you sure this
>actually works?
>
>As far as I know from Google they are running their spiders on a cluster
>of linux boxes. A site isn't scanned all at once, but every page is
>scheduled, fetched and indexed from different machines with different
>IPs. Are you using a different trick to keep the 'isSpider' information?
>
>Cheers,
>Lukas
>
>
>
>
I am using Cee's spider detection logic from his Janus package. That
logic works like this:
The "user-agent" field is retrieved from the header of the request, and
tested via #userAgentIsSpider:, which in turn uses
#knownSpiderUserAgentPatterns. The code to those two methods is at the
end of this message. If #userAgentIsSpider: returns true, then the
request is from a spider.
Additionally (and independently of the #userAgentIsSpider: test), there
is an "autoSpider" detection feature that works like this: if any given
request asks for the "robots.txt" file, then the request is assumed to
be coming from a spider. The user agent and the IP of such a request is
then squirreled away, and for all additional requests from that same
user-agent and IP, they are all assumed to be from the same spider, up
to a time limit of one hour since the last request from that IP/user-agent.
If anybody has any ideas for spider-detection improvement, I'm sure that
both Cee's and I would be very interested.
Lukas, I'm guessing based on your response above, that a better
algorithm for the "autoSpider" detection feature might be to only use
the first 3 numbers of the IP instead of all 4 numbers for the
"autoSpider" feature. Thus, if, say, a request comes in for a
"robots.txt" file, and it has a "FooBar" for the user-agent field, and
the IP of that request is, say, 10.25.50.75, then any subsequent
requests from any IP beginning with 10.25.50 that also have "FooBar" for
the user-agent field would be deemed to be from the same spider.
What does everybody think?
Nevin
--
Nevin Pratt
Bountiful Baby
http://www.bountifulbaby.com
(801) 992-3137
******************************************
userAgentIsSpider: ua
self knownSpiderUserAgentPatterns
do: [:each | (each match: ua)
ifTrue: [^ true]].
^ false
******************************************
knownSpiderUserAgentPatterns
"Mail enhancements to cg at tric.nl"
^ #('''IndexTheWeb.com Crawler7''' '*Teradex Mapper*' 'ASPseek/*'
'AcontBot' 'AlkalineBOT/*' 'AmfibiBOT' 'Aruyo/*' 'Aspseek*'
'AstraSpider++ Ver*' 'AustroNaut DeepBlue' 'BBCi Searchbot*' 'BBot/*'
'BE-Crawler' 'BaiDuSpider' 'Baiduspider+*' 'BeepSearch' 'Buscaplus
Robi/*' 'CJ Spider/*' 'Caddbot/*' 'CheckUrl' 'Checkbot/*' 'CipinetBot*'
'Cityreview Robot*' 'Comodo HTTP(S) Crawler*' 'Convera Internet
Spider*' 'CyberSpyder Link Test/*' 'DYPcheck/*' 'DeadLinkCheck/*' 'Deep
Link Tester*' 'DeepIndex*' 'Der Bot aus Poppelsdorf*'
'EasyWebPromotion*' 'Educate Search*' 'EgotoBot/*'
'EnriqueElRobotdeMirago' 'ExactSeek Crawler/*' 'FAST Data Search
Crawler*' 'FAST-RealWebCrawler/*' 'FAST-WebCrawler/*' 'Fast PartnerSite
Crawler' 'Find Link Check Spider' 'Find Link Check Spider Crawler'
'Find LinkChecker Web Crawler Spider Gatherer' 'Gaisbot/*'
'GalaxyBot/*' 'Gather' 'Geobot/*' 'GetURL/*' 'Gigabot/*' 'Goblin/*'
'GoogleBot' 'Googlebot/*' 'GornKer Crawler' 'GrigorBot *' 'HALO, the
magical bot' 'HenriLeRobotMirago' 'HenryTheMiragoRobot*' 'Hitwise
Spider*' 'Html Link Validator*' 'IPiumBot*' 'IndexTheWeb.com Crawler*'
'InfoNaviRobot*' 'Infomine Virtual Library Crawler/*' 'Infoseek
SideWinder/*' 'Infosniff Sniffer' 'Inktomi Search' 'Insitor, Search
engine of gods' 'InternetSeer.com' 'Jabot/*' 'LWP::Simple/*'
'LexiBot/*' 'LiVe Link Verificator' 'LinkAlarm/*'
'LinkLint-checkonly/*' 'LinkScan Server/*' 'LinkScan/*' 'LinkSweeper/*'
'LinkWalker' 'Linkbot*' 'ListBidBot*' 'Lycos-News-Xml-Fetcher'
'Lycos_Spider_(modspider)' 'MOMspider/*' 'MSNBOT/*'
'Mediapartners-Google/*' 'MetaGer-LinkChecker' 'MetaTagRobot/*'
'MnogoSearch/*' 'NCSA Beta 1*' 'NPBot' 'NUTOMI_BOT'
'NationalDirectory-WebSpider/*' 'NetResearchServer*' 'Novell Web Search
Indexer''Openbot/3.*' 'Openfind*' 'Overture-WebCrawler/*' 'Pelusita
Spider*' 'QweeryBot*' 'RCcrawler*' 'RoboCrawl*' 'Robot/*' 'RobotAgent'
'RobotMidareru/*' 'Scrubby/*' 'SearchSpider*' 'Seeker.lookseek.com'
'SightQuestBot/*' '*(Slurp*' 'SpiderKU/0.9' 'SpiderS1' 'Spinne/*'
'Sqworm/*' 'StackRambler/*' 'SurveyBot/*' 'Szukacz/1*'
'Terrar-UK_Search*' 'TopSpots-Autobot/*' 'TrapScanner/*'
'TurnitinBot/*' 'URLSpiderPro/*' 'URL_Spider_Pro/*' 'UdmSearch'
'UdmSearch/*' 'Ultraseek' 'Vindex *' 'WMWWebBot' 'WebRACE/*' 'WebReaper
*' 'WebSauger *' 'WebTrends Link Analyzer' 'WebTrends/*' 'Wget/*'
'Willow Internet Crawler *' 'Xaldon WebSpider *' 'YellSpider' 'Zoek.nl
N/G *' '*ZyBorg/1.0*' 'amibot' 'fscrawler/*' 'htdig/*' 'ia_archiver'
'ilse/*' 'k2spider' 'libwww-perl/*' 'minibot(NaverRobot)/*'
'mogimogi/1.0' 'nabot' 'rnmcrawler' 'search.ch *' 'searchukBot *'
'semanticdiscovery/*' 'sitecheck.internetseer.com *' 'snarf/*'
'spider at spider.ilab.sztaki.hu *' 'toverobot/*' 'tumba!-viuva negra/*'
'uksearchpages.co.uk' 'verzamelgids.nl - Networking4all Bot/*'
'vspider' 'webrank' 'woelmuis.nl' )
More information about the Squeak-dev
mailing list
|