[Seaside] Re: spiders and docs
Nevin Pratt
nevin at bountifulbaby.com
Sat May 7 08:55:07 CEST 2005
Cees de Groot wrote:
> On Tue, 12 Apr 2005 00:48:21 +0200, Avi Bryant <avi.bryant at gmail.com>
> wrote:
>
>> Ok, so clearly this is a problem we need to deal with (I've never seen
>> it because all the apps I've deployed have login pages :). Does
>> anyone have any suggestions?
>
>
> Seaside should catch requests for /?.*/robots.txt probably and return
> a file prohibiting robots from accessing any URL. That would stop
> 99.99% of the bots before they can do any damage.
>
> Then invent some mechanism (Janus-like, maybe - it is quite a generic
> thing) to selectively open up parts of Seaside apps to bots. Or use
> 'my' HV+Seaside suggestion, forbidding bots to enter the Seaside part.
Forgive this late response to Cees' post, but I just noticed it.
Anyway, Bountiful Baby uses Janus-like code to detect spider bots and
feed them cached pages. The cached pages, in turn, have each link on
the page linking to yet another cached page. So, if the spider is
detected, it has no particular effect on the site. It is when the
spider is undetected that the damage can be done.
If everyone remembers, I started this thread by commenting that I've had
my Bountiful Baby image grow to some pretty huge numbers recently
(around a gig). Well, the image is currently hovering at around 40 MB,
which is very reasonable, and has been hovering around there for a
couple of weeks now, without an image restart or anything. And, as
anybody can see, 40 MB is much more reasonable of a size.
I now think my runaway-growth image was due to either an undetected
spider(s) run amock, or else a deliberate denial-of-service attack. The
way it ramped up, though, over a period of more than a week (it
definitely wasn't sudden), leads me to think it might have been an
undetected spider run amock rather than a DOS. And, I think that the
way it suddenly disappeared is because the spider author changed their
spider code to be more benign-- and I'd bet that was because Bountiful
Baby was not the only site that the spider "bothered". But all of that
is just a guess-- I have no hard data to substantiate it.
But, I also think the following suggestion by Avi is a genious suggestion:
>Yes. I wonder what strategies we can use to detect and cope with
>that. One I can think of is to link the expiry time to how much the
>application has been used: if all you do is request the homepage, your
>session will expire very quickly, but if you look around a little more
>you're given more time. That seem reasonable?
>
>
This should be a preferences-tunable parameter.
Nevin
More information about the Seaside
mailing list