[Seaside] Re: spiders and docs

Nevin Pratt nevin at bountifulbaby.com
Sat May 7 08:55:07 CEST 2005


Cees de Groot wrote:

> On Tue, 12 Apr 2005 00:48:21 +0200, Avi Bryant <avi.bryant at gmail.com>  
> wrote:
>
>> Ok, so clearly this is a problem we need to deal with (I've never seen
>> it because all the apps I've deployed have login pages :).  Does
>> anyone have any suggestions?
>
>
> Seaside should catch requests for /?.*/robots.txt probably and return 
> a  file prohibiting robots from accessing any URL. That would stop 
> 99.99% of  the bots before they can do any damage.
>
> Then invent some mechanism (Janus-like, maybe - it is quite a generic  
> thing) to selectively open up parts of Seaside apps to bots. Or use 
> 'my'  HV+Seaside suggestion, forbidding bots to enter the Seaside part.


Forgive this late response to Cees' post, but I just noticed it.

Anyway, Bountiful Baby uses Janus-like code to detect spider bots and 
feed them cached pages.  The cached pages, in turn, have each link on 
the page linking to yet another cached page.  So, if the spider is 
detected, it has no particular effect on the site.  It is when the 
spider is undetected that the damage can be done.

If everyone remembers, I started this thread by commenting that I've had 
my Bountiful Baby image grow to some pretty huge numbers recently 
(around a gig).  Well, the image is currently hovering at around 40 MB, 
which is very reasonable, and has been hovering around there for a 
couple of weeks now, without an image restart or anything.  And, as 
anybody can see, 40 MB is much more reasonable of a size.

I now think my runaway-growth image was due to either an undetected 
spider(s) run amock, or else a deliberate denial-of-service attack.  The 
way it ramped up, though, over a period of more than a week (it 
definitely wasn't sudden), leads me to think it might have been an 
undetected spider run amock rather than a DOS.  And, I think that the 
way it suddenly disappeared is because the spider author changed their 
spider code to be more benign-- and I'd bet that was because Bountiful 
Baby was not the only site that the spider "bothered".  But all of that 
is just a guess-- I have no hard data to substantiate it.

But, I also think the following suggestion by Avi is a genious suggestion:

>Yes.  I wonder what strategies we can use to detect and cope with
>that.  One I can think of is to link the expiry time  to how much the
>application has been used: if all you do is request the homepage, your
>session will expire very quickly, but if you look around a little more
>you're given more time.  That seem reasonable?
>  
>

This should be a preferences-tunable parameter.

Nevin



More information about the Seaside mailing list