Full-text search package?

Tue Jan 22 22:32:26 UTC 2002

On 22 Jan 2002, Cees de Groot wrote:

> Scott A Crosby <crosby at qwes.math.cmu.edu> said:
> >Hack'er one up yourself? Reverse indexes aren't all that slow or large.
> >Roughly comparable to the size of the origional data.
> >

> I was thinking about just that. The only thing I am mainly concerned
> about was the best way to store the stuff on disk.
>

The complexity is in dealing with the disk. If you can assume
everything is in RAM, things get a *lot* easier. I'm assuming this.

The rough size for a reverse index no more than:
  4/(avg word length) * (origional full database size) +
  k*(number of distinct words) +
    (size of distinct words)

Where the first term is the size of the actual reverse index. (the size of
all the lists from words to objects).  The reason its cheap is that we
represent a word in origional text as a 4-byte pointer to the text itself.
If words are long, or occur several times in that text, then this can
actually be smaller than the origional dbase. It is also the only term
whose size is proprortional to the dbase. Note that this is an
overestimtate. It assumes that each word only occurs in a document once.

The other two terms are overhead.

The second term is the size of the table indexing the list of all words.
(12<k<24 is implementation overhea; wait till its implemented for the
exact number), and the third term is the cost of actually listing the
words.

Basicallly, if words have a fairly long length (~5) and most words occur a
few times (~4) throughout the dbase, or within a file, the reverse index
will be no larger than the size of the origional data.

> By all means, step up - I'm busy enough as it is :-). I'll then try to fit on
> some mailman archive indexing code and a Swiki hack so you can actually query
> the squeak-dev archive.

Oh, will I have neat things for that. :)

The primitive query language would be:

 (A or B or C) and (M or N) and (X or Y or Z)

Basically, conjunctive normal form, which is an AND of OR's.

Each term '(A or B or C)' may search in a different index on the
same database.

For example:
   (Title contains 'squeak or vmmaker') AND (Author contains 'crosby' or
'alan')

Prefix matching is implemented by expanding 'lock*' by replacing it with
all words that have that prefix (for example:)

  (Body contains 'lock*') turns into:
   Body contains 'lock' or 'locking' or 'locked' or ...)

Substring matching works the same, but is somehwat more expensive because
you have to look at each element in the entire list to see if it contains
the substring.)

The only reason its not half-done already is that I'm refactoring String.

--
CPU time is roughly, proportional to the number of matching documents for
all terms (A, B, C, D, ...), except that if *any* of the clauses
'(A or B or C)' matches more than 3500, you'll hit the identityHash
problem, and need to include that patch. I could be clever and avoid
this somewhat, but I'm going for simplicity. You may have similar problems
if the same word occurs in thousands of documents.

This will also require an updated version of my string refactor patch.
(which does touch the VM, so you'll want a new VM afterwards.)
--

Does this sound sufficient for your needs, specifically, the restricted
query language and the database-size limitations.

Scott