New subject: Weighted indexing

8 Dec 2005


      On 12/8/05, Chris Muller afunkyobject@yahoo.com wrote:
...
After a couple of days running the load, I realized that common prepositions,
'THE' 'TO' 'A' 'I' 'AND' 'OF' 'IS' 'IN' 'IT' 'THAT' 'FOR' 'ON' 'THIS' 'BE'
'YOU' 'WITH' 'BUT' 'HAVE' 'NOT' 'IF' 'AT', etc., etc., was really wasting time
and space.  When you have 150,000 occurrences of the same word, it costs a lot
to keep inserting more and, yet, the value of that entry continues to dilute..
Well, I'm going to tackle this two-step - first a browsable mail
archive (by date, senders, references), then full-text. About the
full-text I'm still not too sure what to do - quality FTX is hard, and
there are lots of good libraries out there (Lucene, mnogosearch,
manazu, ...). So I'll probably reuse what's there rather than
implement something that doesn't work.
...
Separate #subject indexing might be a good idea..
I'd probably skip all
those lines that begin with ">" (in replies to other posts) simply because
they're so darn many of them.
What you want, of course, is weighted indexing. Words in #subject
should have a higher weight that words in the message body, which
should have a higher weight than words in attachments, which should
have a higher weight than words in quoted messages (">" lines).
Probably the thing to do is build a index that holds two integers -
weight and object - instead of a single one, or build an index per
weight (probably easier to do in Magma).
Then, of course, after retrieval, you need to score every hit, based
on weight, number of occurrences (too many and the score could
presumably go down again), proximity if you're searching on multiple
keywords, etcetera. Finally, you present the whole hitlist in scored
order.
As I said, there's lotsacode out there and there's a reason most of it
is ugly and complicated :-)

Re: Indexing attributes that return collections?