[squeak-dev] SqueakSource indexability (aka should we just ask
crawlers to desist?)
Bert Freudenberg
bert at freudenbergs.de
Wed Apr 28 20:31:01 UTC 2010
On 28.04.2010, at 22:08, Ken Causey wrote:
>
>> -------- Original Message --------
>> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
>> ask crawlers to desist?)
>> From: Bert Freudenberg <bert at freudenbergs.de>
>> Date: Wed, April 28, 2010 2:59 pm
>> To: The general-purpose Squeak developers list
>> <squeak-dev at lists.squeakfoundation.org>
>>
>>
>> On 28.04.2010, at 21:07, Ken Causey wrote:
>>>
>>> At times access to source.squeak.org becomes slower, as has been the
>>> case today. I can see in the logs that various web-crawlers are the
>>> likely culprit. Having the information there accessible via search
>>> engines is a wornderful thing but I have to suspect that the Seaside
>>> session IDs eliminate this option. (Of course when URLs like
>>> http://source.squeak.org/trunk.html are found on other sites they then
>>> become indexed.)
>>
>> Which URLs are the bots accessing?
>
> Well, without detailed analysis it seems to be everything. Feel free to
> look at ~squeaksource/apachelogs/.
>
>>
>>> Unless I'm mistaken about this, and I would appreciate any guidance, it
>>> seems like we need to add a robots.txt to the site which guides or
>>> simply asks crawlers to stay away. Thoughts? I'm no SEO export.
>>
>> We do have a robots.txt:
>> http://source.squeak.org/robots.txt
>
> Aha. Well, I know little about this subject. But if this means what I
> think it means it seems that the crawlers are ignoring it.
I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used
User-agent: *
Disallow: /@*
But it should be
User-agent: *
Disallow: /@
I'm going to fix that, let's see how it works out.
- Bert -
More information about the Squeak-dev
mailing list
|