DVD's, Project Gutenberg, Full Text Search

Les Tyrrell tyrrell at canis.uiuc.edu
Tue Jan 29 18:50:12 UTC 2002


> On 29 Jan 2002, Cees de Groot wrote:
>
> > I just made a tally of the current PG size. It's about 2Gb zipped. That
fits
> > on a 3CD set or a single DVD, with room for title/author/abstract full
text
> > indexing. Now, wouldn't be a worthwhile project to write a PG
browser/indexer,
> > so in a year or so we could have ISO/UDF images on-line that anyone could
> > download, burn, and as a result have access to a major part of the world's
> > literature on their system? Whatever that system is?

Documents in PG tend to have an informal structure- just enough to discern it
for individual documents, or texts written by the same author and entered by
the same transcriber, but not enough to count on absolutely, and not enough to
re-use throughout the corpus.  I wrote simple parsers for a number of these,
but in each case there is some need to go back to the text and fudge it a bit
to get it into "structured" form.  But if you have an interest in a particular
author, say Jules Verne, then you will find that there is not too much work
involved in getting things into a structured form, and a parser for that
particular case working.

The title/author index was also rather semi-structured... again, enough so
that large chunks of it could be regularized, leaving only a few cases
requiring manual attention.

Regularization of document and index structure, along with parsers for each of
the document structures, would be a manageable sub-project.  Somewhere I have
the code that I used when I was interested in doing this.

- les






More information about the Squeak-dev mailing list