DVD's, Project Gutenberg, Full Text Search

Cees de Groot cg at home.cdegroot.com
Tue Jan 29 20:46:27 UTC 2002


Les Tyrrell <tyrrell at canis.uiuc.edu> said:
>Documents in PG tend to have an informal structure- just enough to discern it
>for individual documents, or texts written by the same author and entered by
>the same transcriber, but not enough to count on absolutely, and not enough to
>re-use throughout the corpus.  
>
It'd be an effort to get it all right, it's indeed a pity that PG doesn't
give more attention at producing stuff that's machine digestible. But
for starters, just opening the .zip files and displaying whatever is in
there would do (for starters, we could even assume an on-line connection
and just ship the image around).

>The title/author index was also rather semi-structured... again, enough so
>that large chunks of it could be regularized, leaving only a few cases
>requiring manual attention.
>
I was also thinking of adding value here - pictures of authors,
biographies, links to Library of Congress catalog cards, etcetera (hey,
make SqF an amazon.com affiliate and link to the place where you can
buy a paper edition!). I don't have the author list handy, but it would
certainly be in the scope of a community project. Manually maintaining
such information would not be too much, especially since everything that's
done usually remains its validity (until they discover, of course, that Bacon
and Shakespeare *were* one and the same guy ;-)).

-- 
Cees de Groot               http://www.cdegroot.com     <cg at cdegroot.com>
GnuPG 1024D/E0989E8B 0016 F679 F38D 5946 4ECD  1986 F303 937F E098 9E8B



More information about the Squeak-dev mailing list