Celeste "empty trash" speeds things up!

Lex Spoon lex at cc.gatech.edu
Thu Apr 7 13:19:46 UTC 2005


Ron Jeffries <ronjeffries at acm.org> wrote:
> I presently have 137,396 emails stored in my email client, The Bat!,
> basically everything I've received since April 15, 2002. It's sorted
> into about 20 folders, including the most generic one at over 40,000
> mails. The Bat! can switch to the 40,000 mail folder and get the
> list up in about 3 seconds. It can search all 40,000 file for header
> information in 3 seconds, and for content in the mail text in about 2
> minutes.
> 
> If Celeste is slow with substantially fewer mails than that, then I
> would most respectfully suggest that there's something in its design
> that is open to improvement. Sure, The Bat! is probably written in
> C++ or something, but we all know that high-volume speed comes from use of
> intelligent data structures, not low-level language. Don't we?
> 
> Again ... now that we know fast email is possible with large
> volumes, the question becomes ... how might we do that, if we cared?

Hang on -- Celeste actually does open large databases in a matter of
seconds.  However, there is a significant user difference between half a
second and 3 seconds.  I'm not sure, from available data, whether
Celeste is slower or faster than the system you decribe, but it does
sound like Celeste gets more functionality due to its avoidance of
separate folders.

Additionally, note that Celeste does not need to actually load the index
file very frequently, in normal usage.  Instead, you can keep an open
Celeste window in a Squeak image somewhere.  When you restart the image,
Celeste will not relead the index unless it has changed on disk.  Thus,
in practice you only see Celeste *saving*--which is much faster.  It is
especially faster, if you avoid saving the entire index file, but
instead append to it when possible -- this was done in some versions of
Celeste, but I don't know whether the current one does or not.

These objections aside, Celeste can certainly be faster.  This message
tosses out a few thoughts for anyone interested in trying to speed this
up, and then explains why I think folders are obsolete and why I hope we
can keep performance high enough that we can stay away from them.


The main problem with Celeste's loading speed is that no one cares
sufficiently to redesign the format and risk losing some email.  I have
actually rewritten the Celeste backend a few times -- and most of those
efforts actually succeeded in making it faster (the notable exceptions
were the btree-based formats -- go figure!).  Even the successful ones
did not get adopted, however.  No one seemed to care enough about the
speed increase, to warrant risking their email.  Maybe a future effort,
should make the faster format be optional, at least for a few years
while it is tested out.


If someone wants to try, though, the situation is that the index file is
stored in a text format.  Reading it requires parsing a large pile of
decimal numbers.  It should be possible to speed up the reading of the
decimal numbers, but I submit that it would be better to design a new
format that does not require parsing decimals!  To this end, it may be a
good idea to transpose the index file, so that instead of looking like
this:

	message ID #1
	subject #1
	from #1
	date #1
	message ID #2
	subject #2
	from #2

it would look like this:

	message ID #1
	message ID #2
	....
	from #1
	from #2
	...

etc.  This should allow for fast loading techniques, especially for the
integer data, because loading long sequences of homogenous data seems
likely to be fast.  A secondary advantage of this transposed format, is
that you can now extend the format.  Simply add a "table of contents" at
the front of the file, and you are done.  The ability to extend the
index file, would open the door to several new features that have thus
far died on the vine.

Alternatively, we could try using a real database.  I think we should
put this off, though.  A database adds impedence over a simple "array of
index entries".  As long as the index can be loaded and saved reasonably
quickly, it seems best to continue using that approach.

Finally, keep in mind that Celeste does not use folders.  Instead, it
saves most of a user's messages into one database.  So far, the
performance has been high enough that there is no reason to go back to
using individual folders like most email readers do.  Programs like pine
and mutt use separate folders because they use the standard Unix mail
format, which does not include an index.  Celeste has its own format,
which includes an index, and thus it can still load and save reasonably
quickly even when there are tens of thousands of messages in the
database.

I find the change to a single database to be one of the best selling
points of Celeste.  I no longer have to think about questions like "do I
save this email in the 'squeak' folder or the 'guzdial' folder" ?  I no
longer have to think "now which folder did I save that email from Mark
into...?".  If I want messages from Mark, I just ask my mail reader to
filter for messages from Mark.  I would hate to go back.  Celeste has
"categories" for diehards who just can't live without folders, but I
find the only categories I really use are essential workhorses like "new",
".tosend.", and ".sent.".  Categories just aren't useful,
and to add to it, they require mental work to keep them up to
date.  Why should I spend time to maintain a categorization that is not
useful?


Lex


PS -- there's been discussion on the archive in the past.  people whe
are Really interested should dig some of it up.



More information about the Squeak-dev mailing list