Subclassing Engines (was Re: Moore's law and why persistence may not be necessary. (fwd))

Scott A Crosby crosby at qwes.math.cmu.edu
Wed Jan 23 21:07:56 UTC 2002


On Wed, 23 Jan 2002, Bijan Parsia wrote:

> On Wed, 23 Jan 2002, Bijan Parsia wrote:
>
> After indexing 93.7 meg messages & gc.
>

26mb to index 93mb of mail messages. :)   Sweet.

>
> self anyOf: #(bijan scott mark) finds 3641 messages in 131 milliseconds
> (woohoo!)

Sweet!

>
> self anyOf: #(from) finds 19868 messages (all of them, I think) in 13359
> milliseconds (copying the set?? why copy the set? ah, something with the

Why did it copy? Because this was a pre-alpha release that didn't have
that specialcase added in yet.  :)

But first, lets fix the problem that that is an indication of. Rebuilding
a set that size should take no more than hundredds of milliseconds.

You *may* be hitting the identityHash problem, HARD, with building a set
that size. (That is a problem where if an object is stored in a Set, but
uses identityHash for the hash code, there's a severe performance
degradation.)

To fix, try:
  1. include my identityHash patches from a few months ago
  2. implement a *cheap*[*] hash function and copy&paste the engine code
to use Set instead of IdentitySet.
  3. implement a *cheap*[*] hash function and wait for the next version
where you can choose the class it uses for sets. (IdentitySet, Set, or
maybe WeakSet)

But most importantly, PROFILE!!

[*] Use profiling.. See below. I've had versions that spent 95% of their
time in String>>hash.

PS. I suspect that you may hit quite a few issues in Collections.
Excluding the symboltable, the stock image doesn't have ONE set of over
400 elements. You're building many set with hundreds to thousands to tens
of thousands of elements. :)

>
> I don't really want to run a full particpantHas: search :)
>

Huh?

> I'm trying to write out the index to disk using ReferenceStream along the
> following lines:
>
> 	|rr|
> 	rr _ ReferenceStream fileNamed: 'EMAIL.fulltext'.
> 	rr nextPut: self.
> 	rr close.
>
> At a 13 meg index this is going *painfully* slowly. I'm up to 143030

Not surprised for two reasons:
  1. Most of the index is references, not binary literals.
  2. You may be hurt by the identityHash problem.

> I suspect I should have just put message ids in the document
> index. IndexFile already compactly serializes the index, and I *really*

Huh?

> only need the ids, I imagine. Might speed up the set copying/merging
> too. Advice is welcome!

First advice: PROFILE PROFILE PROFILE

MessageTally spyOn:
   [Morph selectorsDo: [ :method | di add: ((Morph sourceCodeAt: method)
     asString)]].

See where its spending its time, then either solve it yourself (and tell
me what you did.). Try my identityHash patches. Or, send me the
MessageTally.

Scott




More information about the Squeak-dev mailing list