Subclassing Engines (was Re: Moore's law and why persistence may not be necessary. (fwd))

Wed Jan 23 21:43:10 UTC 2002

On Wed, 23 Jan 2002, Scott A Crosby wrote:

> On Wed, 23 Jan 2002, Bijan Parsia wrote:
> 
> > On Wed, 23 Jan 2002, Bijan Parsia wrote:
> >
> > After indexing 93.7 meg messages & gc.
> 
> 26mb to index 93mb of mail messages. :)   Sweet.

The ReferenceStreamed DocumentIndex comes in at 18.6 megs, fwiw.

> > self anyOf: #(bijan scott mark) finds 3641 messages in 131 milliseconds
> > (woohoo!)
> 
> Sweet!

Indeed. Good show.

Note that this is with a fair number of the headers.

> > self anyOf: #(from) finds 19868 messages (all of them, I think) in 13359
> > milliseconds (copying the set?? why copy the set? ah, something with the
> 
> Why did it copy? Because this was a pre-alpha release that didn't have
> that specialcase added in yet.  :)

Yes :)

> But first, lets fix the problem that that is an indication of. Rebuilding
> a set that size should take no more than hundredds of milliseconds.

I figured.

> You *may* be hitting the identityHash problem, HARD, with building a set
> that size. (That is a problem where if an object is stored in a Set, but
> uses identityHash for the hash code, there's a severe performance
> degradation.)
> 
> To fix, try:
>   1. include my identityHash patches from a few months ago
>   2. implement a *cheap*[*] hash function and copy&paste the engine code
> to use Set instead of IdentitySet.
>   3. implement a *cheap*[*] hash function and wait for the next version
> where you can choose the class it uses for sets. (IdentitySet, Set, or
> maybe WeakSet)

I'll give these a shot at some point, after...

> But most importantly, PROFILE!!

Yes :)

> [*] Use profiling.. See below. I've had versions that spent 95% of their
> time in String>>hash.
> 
> PS. I suspect that you may hit quite a few issues in Collections.
> Excluding the symboltable, the stock image doesn't have ONE set of over
> 400 elements. You're building many set with hundreds to thousands to tens
> of thousands of elements. :)

Yep. It's a great test and showcase for your speedups. Celeste is a
perfectly sane app for these kinds of things.

> > I don't really want to run a full particpantHas: search :)
> >
> 
> Huh?

Use the normal Celeste filtering code. Hmm. It wasn't nearly as bad as I
thought:

In a MailDB inspector:

Time millisecondsToRun: [(indexFile keys select:
			[: id | (self getTOCentry: id) 
			participantHas: #(bijan mark scott) ]) inspect]

Takes 1933 milliseconds to find 2249 records.

Still, factor of 14 speedup. Not bad. Note I'm comparing stuff in the same
VM/Image setup.

[snip]
> Not surprised for two reasons:
>   1. Most of the index is references, not binary literals.
>   2. You may be hurt by the identityHash problem.

Interesting.

> > suspect I should have just put message ids in the document
> > index. IndexFile already compactly serializes the index, and I *really*
> 
> Huh?

I'm sticking references to an IndexFileEntry in the dbase. They could as
easily be message ids, which are smallintegers.

> > only need the ids, I imagine. Might speed up the set copying/merging
> > too. Advice is welcome!
> 
> First advice: PROFILE PROFILE PROFILE

Of course :)

> MessageTally spyOn:
>    [Morph selectorsDo: [ :method | di add: ((Morph sourceCodeAt: method)
>      asString)]].
> 
> See where its spending its time, then either solve it yourself (and tell
> me what you did.). Try my identityHash patches. Or, send me the

Yep.

Still very nice. I might do up the headers stuff first :)

I'll definitely want a, er, somewhat smaller test set to play with :)

Cheers,
Bijan Parsia.