Moore's law and why persistence may not be necessary. (fwd)

Wed Jan 23 16:10:17 UTC 2002

Hmm. I realized that the list may be interested, especially Celeste
users/hackers.

Cheers,
Bijan.

---------- Forwarded message ----------
Date: Wed, 23 Jan 2002 11:09:19 -0500 (EST)
From: Bijan Parsia <bparsia at email.unc.edu>
To: Scott A Crosby <crosby at qwes.math.cmu.edu>
Subject: Re: Moore's law and why persistence may not be necessary.

On Wed, 23 Jan 2002, Scott A Crosby wrote:

> On Wed, 23 Jan 2002, Bijan Parsia wrote:
> 
> 
> > For what it's worth, I've started on a MailMessageIndex and
> > IndexAdaptor. Thus far, it's working spiffy. I was able, *easily* to
> 
> Wait! You're not supposed to be futzing with my indexers, ONLY the
> adaptors. Cause the indexers are going to seriously change soon. :)

I subclassed DocumentIndex so I could use the following #add: method:

add: anIndexFileEntry
	"Add a document into the database."
	| words |
	words _ indexAdaptor terms: anIndexFileEntry.
	words do: [ :word |
				|canonword set|
					canonword _ (indexAdaptor
canonicalize: word).
					set _ dbase at: canonword
ifAbsent: [IdentitySet new].
				     set add: anIndexFileEntry.
					dbase at: canonword put: set]

So, what you're looking up is an IndexFileEntry rather than the entire
text. I won't mind rewriting :)

> Or are you writing a mail indexer that uses mine as an engine. (by using
> containment, not inheritence.)

No, I'm subclassing. The key bit is to not have to store all the text in
the db. I don't *want* all the text :) I want all the text to stay on
disk.

My test set is 2234 squeak list messages. For that, my MailMessageIndex
does:
	Time millisecondsToRun: [self anyOf: #(bijan lex mark)] 10

Compared with:
	Time millisecondsToRun: [self filteredMessagesIn: '1Sept-Oct2001']
		421

Where the filter is:
	m participantHas: #(bijan lex mark)

Note the disparity in what they're doing. The MailMessageIndex is doing a
*full text* search (minus attachments) of those messages. So it returns
more messages (344) than the Celeste search, which is limited to the From,
To, sender fields (returns 281).

Natch, if you want the more specific search, that won't help :) Point is,
that the MailMessageIndex *smokes* the best case current Celeste search.

Smokes.

Totally.

Indexing takes a *while*, of course. And, on this Win2000 box, squeak
*engulfs* all available CPU time, which makes it hard to do anything
else. I'm going to try threading the indexing in squeak, and see if that
helps.

> > substitute in and IndexFileEntry for the fulltext in the Index. Which
> > means that this 1) can be used directly by Celeste with fairly minimal
> > modifications and 2) it's not going to take a *hell* of a lot more memory
> > than the current Celeste situation and 3) you get blazing full text
> > searchs for that extra memory.
> 
> You *might* be surprised. I'm curious as to how big the index actually is.
> I'll expect 30-60mb.

Hmm. Y'know, I have little idea of how to measure space in Squeak :)

Here are my current VM statistics for memory:

memory			43,036,664 bytes
	old			38,839,168 bytes (90.2%)
	young		376,812 bytes (0.9%)
	used		39,215,980 bytes (91.1%)
	free		3,820,684 bytes (8.9%)

The dbase has 27933 entries. The contents of the entires are references to
existing IndexFileEntries.

At some point today, I hope to index a 93 meg message file (lunch break!),
which current has a 3.1 meg index. Hmm. maybe I'll first try replacing the
index file with one based on MailMessageIndex...

Cheers,
Bijan Parsia.