Set versus IdentitySet

Thu Jan 24 14:54:46 UTC 2002

y
n Thu, 24 Jan 2002, Scott A Crosby wrote:

> On Thu, 24 Jan 2002, Bijan Parsia wrote:
> 
> > On Wed, 23 Jan 2002, Scott A Crosby wrote:
> >
> > > On Wed, 23 Jan 2002, Bijan Parsia wrote:

We're writing a lot, n'est pas?

> > > And if a class doesn't have its own #hash or you use IdentitySet, and you
> > > don't have my identityHash patch[*], and you have more than 4000 documents
> > > in any intermediate set, you'll spend 99% of your time in Set>>scanFor:
> >
> > Hmm. That's painful right? Why the search for From took so long?
> 
> I only conjecture so.

Me too :)

> > Not having 4000 documents in any intermediate set is not a *terrible*
> > limitation :)
> 
> The real limit is about 3000. And if you do get that large, the symptom is
> *severe* performance degradation without my patch.  So far, if somethings
> very slow, its either that or its String>>hash that are to blame.

Yes, so it seems. Still, for a lot of Squeak tasks, this isn't so bad.

> A more serious question is whether to use Set or IdentitySet. With
> IdentitySet, If you can't regenerate documents #== the origionals, then
> you can't remove them. With Set, if you can generate documents #hash and
> #= the origionals, you can remove them.

Right, but I think you're hitting problems due to popping actual strings
in the found sets. If you're using some sort of key into a large text
files or set of text files, it's a bit easier. In this respect, it's
becoming *really* worth it to generalize Celeste's DB.

[snip]
> di add: #Scott
> di remove: #Scott
> 
> will work, because symbols are guarenteed to be unique.

Yep, as will something like:

di add: 34980.
di remove: 34980.

Which is, after all, the kind of thing Celeste uses for keys.

> --
> But, like strings,
> 
> di add: (Array with: aClass with: aSelector).
> di remove: (Array with: aClass with: aSelector).
> 
> will not work.

Absolutely, which is why one should use something else ;) Either a set for
the collection or some trick on the key. I've not profiled, but I suspect
that the Array solution is going to be undesirable from a memory
standpoint. I.e., if you used a fused class/selector symbol (e.g., (aClass
name, '>>', aSelector) asSymbol) then you avoid a buncha problems. (in
exchange for having to unfuse the symbol, etc.). I'll be trying some of
these out.

[snip]
> If you use Set, then all of these work as you expect.
> 
> Generally, always use Set, unless you are guarenteed that the documents
> you index are unique. (For example, They're already in memory, or will
> never be remade, or.)
> 
> In theory, (IE, assuming very fast custom hash function), Sets give you
> the ideal performance for large collections, and nearly ideal for
> small collections, even without my identityHash patch.

Ok. Do need to watch for needlessly duplicated objects. E.g.,
MethodDictionary>>scanFor: has a fair number of distinct words (and
it's hardly a *large* method). If you pop a distinct 
	Array with: MethodDictionary with: #size
into each word's set, you're going to have a fair bit of entirely
pointless bloat. The identity distinct arrays are serving no purpose.
> Use IdentitySet only if your documents satisfy uniqueness, and you don't
> expect more than 2k of them in any intermediate set (without my patch),
> or 8-20k of them (with my patch).
[snip]
> > Unlikely that one can improve large collection handling? Or unlikely that
> > standard techniques can do a lot ("standard techniques" == avoiding big
> > nasty collections :))
> 
> Unlikely that there are any other magic bullets that give big gains. :)

Custom hashs and the like? I don't think there are magic bullets per se ;)

> Other than the identityHash patch (which will probably work OK up to
> 40k-100k or so objects), The other latent problems are slow or nonexistant
> #hash functions.

Aha!

> Those will need to be fixed on a case-by-case basis, for
> example, by caching the hash value.

Right, but I'm already into the case-by-case situation :)

Cheers,
Bijan Parsia.