On Thu, 27 Dec 2007 20:13:35 +0100, Lukas Renggli wrote:
by the big duplication with all the different versions of the same code.
I'm relatively good at search engine optimization (no, not the buzzwords business but the server technical side) because of customer demand (sites with sometimes > 10,000 pages). Could you/Phillippe post some example URLs of contents which ought to be indexed. Can give it a try and report what I find.
http://www.squeaksource.com/robots.txt http://www.squeaksource.com/sitemap.xml.gz
These look indeed reasonable except response header expiration date of /robots.txt and .mcz files (how would anybody reset that date? seems to be impossible and perhaps *this* looks like fraud to them; I personally never go past 12 months; your to be sure that some day it's me who's back in control).
And then the absence of last-modified field in response headers. The latter shouldn't be that hard to add, so that crawlers don't have to work on assumptions and webmaster has a bit more control on content negotiation. *This* is the hot-spot (and not an expiration header past funeral date) when you don't want them to re-index but later perhaps have file format/contents/organizational/conceptual change which are unable to imagine now.
Also the project pages have expiration a minute or so from date header, why would anybody follow their links?
Well then, tried some of the project pages linked from sitemap.xml but Google is by no means interested "We're sorry, but there isn't enough text on this webpage; at least a few paragraphs are necessary to provide results. You can try entering a different URL, or check the box labeled 'Include other pages on my site linked from this URL'."
There ya go, another incarnation of the Squeak+Documentation problem (seemingly many (most?) authors don't write something up on their SqueakSource entries which then can be put onto the crawler's project pages).
This *can* be the reason (but perhaps also the content type of the .mcz files).
How about putting at least "Squeak, Squeaksource, <project name>, <tags>" into html keyword meta data on project pages.
Note that the directory listing produces a slightly different result when being visited by a GoogleBot.
What's in that? I can make more mistakes when attempting to find out than you can imagine. Could you post an example generated by the software from the Regex project (which Google doesn't like to index). Also, are there differences in response headers.
/Klaus
Lukas