Full-text search, performance results.

Wed Jan 23 18:54:42 UTC 2002

Looks like you are off to a good start...  I once wrote an in-memory indexer
that I recall as being rather fast.  If I can get a second or two free, I'll
see if I can do some kind of comparable performance test.  If it still holds
up to what I remember, I'll post it.  It does many of the optimizations you
mention elsewhere ( such as seeking the fastest route to completing a query ).
I recall being able to index fairly large numbers of phrases, each phrase
averaging around 2.5 words, rather quickly.  At its core, it indexes words,
the words then reference the phrases, and the phrases then reference
documents.  The largest collection I indexed with it had nearly 1 million
phrases, the others were often around 1/2 million phrases.  Those indices were
then stored in an object database, which slowed the entire process down
considerably ( by many orders of magnitude ).

If  I were to use it to index e-mail ( which, btw, I do have some code that I
started to do this with... that code later forming the basis for the VW IMAP
code, and bootstrapped VW's MIME parsing  ) I would probably reccomend that
the Document object hang on to the e-mail message id, which could then be used
as a key in a relational database to retrieve the actual document in full raw
text form ( headers and all, unadulterated ).  Message ID's are "guaranteed"
to be unique ( at least, required to be ).

If the index can fit into RAM in its entirety, then I would simply regenerate
it on start up.  I don't think this would take that long, given rapid access
to document phrase vector sets.  I dimly recall being able to do large
collections fairly rapidly.  VisualWorks images doing this indexing rarely
went much beyond 1 Gigabyte in size, but usually when I was committing to the
database rather than doing an in-memory only test.

Nowadays, you can build machines with several Gigabytes of RAM.  But, I
suspect that indexing the entire Squeak mailing list ( for which I have nearly
the whole thing, but not accessible at the moment ), could prove challenging.
So, doing this along the lines of a persistent index would be another thing to
pursue.

- les