Re: SqueakSource and search engines [was: Documentation Suggestion]

27 Dec 2007


      On Thu, 27 Dec 2007 20:13:35 +0100, Lukas Renggli wrote:
...
...
...
by the big duplication
with all the different versions of the same code.
I'm relatively good at search engine optimization (no, not the buzzwords
business but the server technical side) because of customer demand  
(sites
with sometimes > 10,000 pages). Could you/Phillippe post some example  
URLs
of contents which ought to be indexed. Can give it a try and report  
what I
find.
http://www.squeaksource.com/robots.txt
http://www.squeaksource.com/sitemap.xml.gz
These look indeed reasonable except response header expiration date of  
/robots.txt and .mcz files (how would anybody reset that date? seems to be  
impossible and perhaps *this* looks like fraud to them; I personally never  
go past 12 months; your to be sure that some day it's me who's back in  
control).
And then the absence of last-modified field in response headers. The  
latter shouldn't be that hard to add, so that crawlers don't have to work  
on assumptions and webmaster has a bit more control on content  
negotiation. *This* is the hot-spot (and not an expiration header past  
funeral date) when you don't want them to re-index but later perhaps have  
file format/contents/organizational/conceptual change which are unable to  
imagine now.
Also the project pages have expiration a minute or so from date header,  
why would anybody follow their links?
Well then, tried some of the project pages linked from sitemap.xml but  
Google is by no means interested "We're sorry, but there isn't enough text  
on this webpage; at least a few paragraphs are necessary to provide  
results. You can try entering a different URL, or check the box labeled  
'Include other pages on my site linked from this URL'."
There ya go, another incarnation of the Squeak+Documentation problem  
(seemingly many (most?) authors don't write something up on their  
SqueakSource entries which then can be put onto the crawler's project  
pages).
This *can* be the reason (but perhaps also the content type of the .mcz  
files).
How about putting at least "Squeak, Squeaksource, <project name>, <tags>"  
into html keyword meta data on project pages.
...
Note that the directory listing produces a slightly different result
when being visited by a GoogleBot.
What's in that? I can make more mistakes when attempting to find out than  
you can imagine. Could you post an example generated by the software from  
the Regex project (which Google doesn't like to index). Also, are there  
differences in response headers.
/Klaus
...
Lukas