Set versus IdentitySet

Scott A Crosby crosby at qwes.math.cmu.edu
Thu Jan 24 15:53:36 UTC 2002


On Thu, 24 Jan 2002, Bijan Parsia wrote:

> Thu, 24 Jan 2002, Scott A Crosby wrote:
>
>
> Right, but I think you're hitting problems due to popping actual strings
> in the found sets. If you're using some sort of key into a large text
> files or set of text files, it's a bit easier. In this respect, it's
> becoming *really* worth it to generalize Celeste's DB.

Correct. Keys are good, but only if they have equivalence.

di add: 'long text 1' asKey.
di remove: 'long text 1' asKey.

are the arguments #= or #== to each other?  Can you determine #= quickly?

> Yep, as will something like:
>
> di add: 34980.
> di remove: 34980.
>
> Which is, after all, the kind of thing Celeste uses for keys.
>

Yep

> Absolutely, which is why one should use something else ;) Either a set for
> the collection or some trick on the key. I've not profiled, but I suspect
> that the Array solution is going to be undesirable from a memory
> standpoint. I.e., if you used a fused class/selector symbol (e.g., (aClass
> name, '>>', aSelector) asSymbol) then you avoid a buncha problems. (in

That'll *hurt*.. You'll spend about 98% of your time in
String(Symbol)>>hash or String(Symbol)>>copy:from:to:. You'll also pay a
*lot* of memory for the symbols.

',' when used to concatenate more than a couple of things is not cheap.

The best bet would be a custom class that stored the symbol for the
selector, the class object, and a hash-slot that caches the hash of both
of the above. You have the engine use Set's to index these.

Foo>>=
  ^class == class & (selector == selector)

Does such a class exist already?

If not, I'll write it and include it in my next update. (It'll be public,
and out sometime later today or tomorrow.)


> >
> > In theory, (IE, assuming very fast custom hash function), Sets give you
> > the ideal performance for large collections, and nearly ideal for
> > small collections, even without my identityHash patch.
>

> Ok. Do need to watch for needlessly duplicated objects. E.g.,
> MethodDictionary>>scanFor: has a fair number of distinct words (and
> it's hardly a *large* method). If you pop a distinct
> 	Array with: MethodDictionary with: #size
> into each word's set, you're going to have a fair bit of entirely
> pointless bloat. The identity distinct arrays are serving no purpose.

If you're doing:

   di add: (Array with: MethodDictionary with: #size).

No bloat. You're using those 12-byte Arrray's as the Documents.. Which are
shared for each term contained in that document. Your adaptor will take
one of those in, then read the source for that method, then spit out
terms.

Though, You won't want to actually use Arrays, because you'll pay all your
time in hashing or string comparision.

What do you mean by identity dictinct arrays?


Scott





More information about the Squeak-dev mailing list