Set versus IdentitySet

Scott A Crosby crosby at qwes.math.cmu.edu
Thu Jan 24 14:27:31 UTC 2002


On Thu, 24 Jan 2002, Bijan Parsia wrote:

> On Wed, 23 Jan 2002, Scott A Crosby wrote:
>
> > On Wed, 23 Jan 2002, Bijan Parsia wrote:
> >
> > And if a class doesn't have its own #hash or you use IdentitySet, and you
> > don't have my identityHash patch[*], and you have more than 4000 documents
> > in any intermediate set, you'll spend 99% of your time in Set>>scanFor:
>
> Hmm. That's painful right? Why the search for From took so long?
>

I only conjecture so.

> Not having 4000 documents in any intermediate set is not a *terrible*
> limitation :)

The real limit is about 3000. And if you do get that large, the symptom is
*severe* performance degradation without my patch.  So far, if somethings
very slow, its either that or its String>>hash that are to blame.

A more serious question is whether to use Set or IdentitySet. With
IdentitySet, If you can't regenerate documents #== the origionals, then
you can't remove them. With Set, if you can generate documents #hash and
#= the origionals, you can remove them.

This distinction is key. IdentitySet is not generally applicable.

--
For example:

di _ ExactDocumentIndex withAdaptor: (SimpleTextIndexAdaptor new)
                        withSet: IdentitySet.

di add: 'Scott'
di remove: 'Scott'

will not remove 'Scott', because   'Scott' ~~ 'Scott'.

--
While.
x := 'Scott'

di add: x.
di remove: x.

will work, because   x == x == 'Scott'

--

di add: #Scott
di remove: #Scott

will work, because symbols are guarenteed to be unique.

--
But, like strings,

di add: (Array with: aClass with: aSelector).
di remove: (Array with: aClass with: aSelector).

will not work.

--

If you use Set, then all of these work as you expect.

Generally, always use Set, unless you are guarenteed that the documents
you index are unique. (For example, They're already in memory, or will
never be remade, or.)

In theory, (IE, assuming very fast custom hash function), Sets give you
the ideal performance for large collections, and nearly ideal for
small collections, even without my identityHash patch.

Use IdentitySet only if your documents satisfy uniqueness, and you don't
expect more than 2k of them in any intermediate set (without my patch),
or 8-20k of them (with my patch).


> > > Scott feel confident that one can improve large collection handling; even
> > > if not, standard techniques can do a lot).
> >
> > Unlikely.
>
> Unlikely that one can improve large collection handling? Or unlikely that
> standard techniques can do a lot ("standard techniques" == avoiding big
> nasty collections :))

Unlikely that there are any other magic bullets that give big gains. :)

Other than the identityHash patch (which will probably work OK up to
40k-100k or so objects), The other latent problems are slow or nonexistant
#hash functions. Those will need to be fixed on a case-by-case basis, for
example, by caching the hash value.

Scott




More information about the Squeak-dev mailing list