On 12/8/05, Chris Muller afunkyobject@yahoo.com wrote:
After a couple of days running the load, I realized that common prepositions, 'THE' 'TO' 'A' 'I' 'AND' 'OF' 'IS' 'IN' 'IT' 'THAT' 'FOR' 'ON' 'THIS' 'BE' 'YOU' 'WITH' 'BUT' 'HAVE' 'NOT' 'IF' 'AT', etc., etc., was really wasting time and space. When you have 150,000 occurrences of the same word, it costs a lot to keep inserting more and, yet, the value of that entry continues to dilute..
Well, I'm going to tackle this two-step - first a browsable mail archive (by date, senders, references), then full-text. About the full-text I'm still not too sure what to do - quality FTX is hard, and there are lots of good libraries out there (Lucene, mnogosearch, manazu, ...). So I'll probably reuse what's there rather than implement something that doesn't work.
Separate #subject indexing might be a good idea..
I'd probably skip all those lines that begin with ">" (in replies to other posts) simply because they're so darn many of them.
What you want, of course, is weighted indexing. Words in #subject should have a higher weight that words in the message body, which should have a higher weight than words in attachments, which should have a higher weight than words in quoted messages (">" lines).
Probably the thing to do is build a index that holds two integers - weight and object - instead of a single one, or build an index per weight (probably easier to do in Magma).
Then, of course, after retrieval, you need to score every hit, based on weight, number of occurrences (too many and the score could presumably go down again), proximity if you're searching on multiple keywords, etcetera. Finally, you present the whole hitlist in scored order.
As I said, there's lotsacode out there and there's a reason most of it is ugly and complicated :-)
What you want, of course, is weighted indexing. Words in #subject should have a higher weight that words in the message body, which should have a higher weight than words in attachments, which should have a higher weight than words in quoted messages (">" lines).
Probably the thing to do is build a index that holds two integers - weight and object - instead of a single one, or build an index per weight (probably easier to do in Magma).
Yes, an index per weight is how I have designed so far but adding a field #weight to MaHashIndexRecord so each entry in the index can have two Integers, one oid and one weight, as you said, would not be hard. Is this something that would facilitate powerful indexing algorithms?
On 12/8/05, Chris Muller afunkyobject@yahoo.com wrote:
Yes, an index per weight is how I have designed so far but adding a field #weight to MaHashIndexRecord so each entry in the index can have two Integers, one oid and one weight, as you said, would not be hard. Is this something that would facilitate powerful indexing algorithms?
Beats me :-). I'm hardly an expert in this area.
I have toyed with MySQL's FTX in MySQL 4.0 - we loaded the Chamber of Commerce data for the Benelux together with web and email info in it, and people could update their entries to add description, keywords. The weighting done by MySQL was surprisingly useful - maybe it's described somewhere.
magma@lists.squeakfoundation.org