Celeste Addresses?

Mon Apr 30 21:19:03 UTC 2001

About the state of the Celeste index -
A thread a couple of month ago reviewed some of the same issues.
JohnMaloney was involved with the original Babar and mentioned reading
the index on demand and storing the index as binary as two ways to make
it faster.

I've recently changed the index to use binary storage. Unfortunately,
this required quite a lot of refactoring and Celeste was being edited at
the time and I lost the image, etc, so, the code is somewhere between
lost and hopelessly out of date and incompatible. However, a few
conclusions I did reach -
1. Binary is indeed faster than ascii. To be make this precise - not
having to convert the Integers (msgId, message time, length and offset)
is faster. This is as John remembered. The parts of the index that are
strings (subject, from, to, etc) remain the same speed (at best, see
next point). 
2. Checking for CrLfs can be very expensive. Particularly when for
strings in a binary file, CrLfStream tries to figure out the right mode
anew for each string.
3. Binary done right is still no panacea. Significant improvement, but
no big breakthrough.

A few ideas that came to mind as result -
1. Organize the index according to categories, so we can read only the
current category, as per Lex.
2. Write the index in such a way that we can read only as much of a
category as is needed (for example, the latest 200 messages). The
conceptually simple way is to write the index backwards (latest first),
probably the practical way is more along the lines of holding for each
category the offsets of it's index entries.

On second thought, the index serves two purposes. It keeps the msgID ->
message file offsets mapping, which is pretty expensive to recalculate
(scanning the message file). It also caches some useful header info for
all messages, which could be calculated on demand instead. 
So another solution would be to eliminate the msgID, and keep the
offsets directly in the categories file (so we need to move the logging
mechanism to it, for update performance and safety). This means that
nothing more than the message and category files would be needed.
The header cache function, should we find it necessary, could be done by
a separate file keeping TOC entries for the last 1000~ entries we've had
to calculate. My guess is for most activations of Celeste, those just
1000 entries would cover everything the user wants to see. Anyway if not
then, the data is reparsed from the message text.

Whew. I hadn't planned on all that when I started writing.

Poke holes, anybody?

Daniel

"Lex Spoon" <lex at cc.gatech.edu> wrote:
> Address book handling would be really excellent.  There should at least
> be the *option* of keeping the address book locally, so it shouldn't
> *require* LDAP.  But LDAP is cool, too, if someone wants to put it
> together.  One issue is what to do if there are multiple address
> approaches floating around....
> 
> My favorite feature, by the way, is to pick up addresses and names
> automatically as email is viewed, like the "Big Brother Database (BBDB)"
> does.
> 
> Along with address book handling, come a lot of, err, opportunities to
> improve the composition window.  It's kinda lame sitting in a GUI system
> and being forced to edit messages as raw text files.
> 
> FWIW, though, even nicer than an address book would be a faster mail
> database.  Loading the entire index into memory is pretty slow right now
> -- it should only load index entries for the category I'm looking at. 
> If .unclassified. ceases being a pseudo-category, then so be it. 
> Nowadays, compacting is very fast, and .unclassified. could simply be
> built when a compaction is run.
> 
> -Lex
> 
> 
> "Stephan B. Wessels" <stephan.wessels at sdrc.com> wrote:
> > I wrote some address book code for Celeste, including a Netscape
> > addrebook importer, but got distracted by paying work and had to suspend
> > it.  If anyone wants to help we could probably get at least that part
> > working in a weekend.
> > 
> >  - Steve
> > 
> > Mike Rutenberg wrote:
> > 
> > > As far as I know, everyone is doing this manually right now.  I
> > > certainly am.  I want to change this though.
> > >
> > > LDAP is something I know nothing about, but might be a good option
> > > especially if you have an existing (corporate?) address book.
> > >
> > > Interfaces to an external address book is also a possibility.
> > >
> > > There are some very interesting options to use the message index
> > > information as a fast automatically collected database of "important"
> > > email addresses.  This is done previously by JWZ (?) Big Brother
> > > Database for emacs mail reading.  I do this manually by using the
> > > "Participants Filter" to find the email address of my intended
> > > recipient.  I tried some experiments with this last week but have not
> > > finished it.
> > >
> > > Mike