DVD's, Project Gutenberg, Full Text Search

Tyrrell, Les LTyrrell at keyww.com
Tue Jan 29 23:28:36 UTC 2002


Cees de Groot <cg at home.cdegroot.com> wrote in message news:<a371n3$ps1$1 at home.cdegroot.com>...
> Les Tyrrell <tyrrell at canis.uiuc.edu> said:
> >Documents in PG tend to have an informal structure- just enough to discern it
> >for individual documents, or texts written by the same author and entered by
> >the same transcriber, but not enough to count on absolutely, and not enough to
> >re-use throughout the corpus.  
> >
> It'd be an effort to get it all right, it's indeed a pity that PG doesn't
> give more attention at producing stuff that's machine digestible. But
> for starters, just opening the .zip files and displaying whatever is in
> there would do (for starters, we could even assume an on-line connection
> and just ship the image around).

Right- that is an important point.  The texts are quite readable without specialized viewers... that was PG's intent from the very beginning.  I was concerned about structure in my case because I had intended to ingest them into an information processing system.  However, you don't neccessarily have to worry about that to gain some benefits from having some form of search capability.

> >The title/author index was also rather semi-structured... again, enough so
> >that large chunks of it could be regularized, leaving only a few cases
> >requiring manual attention.
> >
> I was also thinking of adding value here - pictures of authors,
> biographies, links to Library of Congress catalog cards, etcetera (hey,
> make SqF an amazon.com affiliate and link to the place where you can
> buy a paper edition!). I don't have the author list handy, but it would
> certainly be in the scope of a community project. Manually maintaining
> such information would not be too much, especially since everything that's
> done usually remains its validity (until they discover, of course, that Bacon
> and Shakespeare *were* one and the same guy ;-)).

Those are great ideas- IMO, PG has been mostly successful at archiving these works, rather than promoting them.  But a PG-oriented community could be rather interesting.

- les



More information about the Squeak-dev mailing list