Celeste improvements!

Lex Spoon lex at cc.gatech.edu
Tue Oct 2 17:32:05 UTC 2001


danielv at netvision.net.il wrote:
> The index file has two parts with different priecs for recovery - 
> If you lose the id -> message file offsets information, you need to
> rescan the messages file. This is unacceptable.
> 
> If you lose everything else in there, you need to parse a message from a
> known location, instead of parsing a few specific lines. This isn't
> really a big deal.
> 
> So - let's split the file in two. 
> 

Ahh, very nice!  This is all true.


> 
> 2. A cache for preparsed header fields. To make it fast, we'll keep it
> small (1000 last entries). Maybe binary, maybe not. Not logged, it's
> just a cache.

This part is a problem with FilterCeleste -- I frequently run filters
across the whole mail database, now that I can, and I'd hate to give it
up.  Perhaps surprisingly, this kind of filtering is actually reasonably
fast -- 20000 string compares just isn't too bad on a modern desktop
machine.

Thus, it would be very nice to keep parsed headers for everything. 
Surely there is a way to dump just a simple data structure en mases to
disk, and to load it back....  Notice that categories files are already
pretty fast -- and they are based on *binary* dumps, not text ones.

By the way, let's not forget that we only need a subset of the header
fields, not all of them.  This list is probably at least close, to get
started with:

	date
	from,to,cc
	subject
	references
	mailing list
	uidl


Most of these are strings, although for date we would want to have a
numeric value, as well.  

Here's a possible overall file format, that would hopefully be fast:

	1. In the header, there is a list of which fields are there, and what
their type is (int or string, though it's accidentally extensible)

	2. After the header, have a list of message id's,  all in a row so that
loading and saving is a bulk operation.

	3. After that, for each field, dump the data, all in a row.  Again,
these can be bulk operations.


To make the numbers efficient, there is surely a way to dump out arrays
of 32-bit numbers to an external stream.  (And if there's not, let's add
one!)  Ideally, there would be a second layer that allows
variable-length integers, as well -- e.g., if in the raw stream you see
an integer with the high bit set, then clear that bit and combine it
with the following integer.

To make strings efficient, it might help to first save all the lengths,
and then to save all the character data after the lengths.  Or maybe
not, I dunno.

Well, just an idea.  Let's not give up on having the whole index in
memory just yet -- it's very handy, and it might be salvagable.

-Lex




More information about the Squeak-dev mailing list