Celeste improvements!

Lex Spoon lex at cc.gatech.edu
Wed Oct 3 16:53:50 UTC 2001


danielv at netvision.net.il wrote:
> [Lex wants a full in memory index to make filters run fast]
> Well, sometimes I do too :-)
> Well, if we make the cache size a parameter, you can set it to 1e6, and
> thus cache everything. The only down side is the two files will be
> slightly bigger than having one unified file as we do now. Probably
> still smaller than the current indices because of the efficient binary
> encoding of numbers.

Let's just give it a try before we back off to something that isn't as
good.  Having the whole index there is *nice*, because you can
efficiently search the whole database.  If you've gotten used to the one
folder regime, this can be hard to picture.  But for example, if you
misfile a message, you can dig it up by searching on the participant and
the subject.  I just tried, and I did a participant search on 18k
messages in about 1 second.  That would be infeasible if the majority of
the index entries had to be reparsed from the messages file.

I grant that it would take more disk space.  But it will still take less
than the message headers themselves, much less the entire messages, and
so it's not a huge blowup.  Also, disk is really cheap nowadays; I don't
even bother running "empty trash" and "compact", unless I'm about to ftp
my database somewhere for some reason.


Of course, it would also take more system memory, which is still
precious.  But for 20k messages, at 1kb per header (I have no idea
what's typical), that would be about 20MB, which is reasonable IMHO for
something as important as email.

In the long run, we could probably figure out what kind of filters are
the most widely used, and add a bunch of on-disk indeces for those
particular filters.  Then we'd have speed, low memory, and fast
startup/shutdown.  Though still at the cost of a lot of disk space.

> About the cache entry format - I like it being self descriptive as you
> propose. I'm pretty sure that making the file binary will get us most of
> the speed improvement. Celeste already knows how to spit out/read in a
> binary number - for the categories file. I'm not sure we need more code
> to do that (or maybe I just didn't understand the part about reading
> multiple numbers).

Hmm, we should try that first, then.  Maybe it's fast enough.


> The important thing to make strings fast is not to use CrLfStream...
> it's detection mechanism backfires horribly when you're switch between
> binary and text modes. Just keep everything binary, and we're fine. 

Additionally, converting numbers to and from ASCII strings is
problematic.  


> BTW, we need one more field type - the date. It's can be written as a
> number but requires proper wrapping in the image.
> 

Instead of dealing with wrapping (if I understand the concern
correctly), couldn't we also allow large integers to be saved?  Most of
the "large" integers would still be fairly small, so we could use a tag
bit to indicate multi-word integers.  That is, if a number is less than
2^31, save it as is; if it's between 2^31 and 2^62, first save the high
bits of the number and add a tag bit, then save the low bits without the
tag.


-Lex




More information about the Squeak-dev mailing list