identityHash patch discussion.

Scott A Crosby crosby at qwes.math.cmu.edu
Tue Dec 11 04:03:17 UTC 2001


On Tue, 11 Dec 2001, Stephan Rudlof wrote:

> Andres Valloud wrote:
> >
> > Hi Scott,
> >
> > > The problem is that those bits are used as-is, to construct a number
> > > in the *specific* range 0...4095, which is used as the starting index
> > > in Set>>scanFor: This flaw means that *any* set and *any* dictionary
> > > that has 4000 elements in it will devolve into a linear search
> > > through all of the elements, when it could be 4000 times faster.
> >
> > Wouldn't this only happen for IdentitySets and IdentityDictionaries?
>
> No, there it is repaired:
>
> >From a previous mail of mine:
> -----------------------
> This problem in
> Set>>scanFor: anObject
>         "Scan the key array for the first slot containing either a nil
> (indicating
> an empty slot) or an element that matches anObject. Answer the index of that
> slot or zero if no slot is found. This method will be overridden in various
> subclasses that have different interpretations for matching elements."
>         | element start finish |
>         start _ (anObject hash \\ array size) + 1.
>         finish _ array size.
>         ...
>



> is repaired in
> IdentitySet>>scanFor: anObject
>         | finish hash start element |
>         finish _ array size.
>         finish > 4096
>                 ifTrue: [hash _ anObject identityHash * (finish // 4096)]
>                 ifFalse: [hash _ anObject identityHash].
>         start _ (hash \\ array size) + 1.
>         ...
> -----------------------

Actually, this second version has that special case, the multiplication by
'(finish // 4096)' in one side of the branch. this is a workaround. By
having 'anObject identityHash' be a multiple of 'anObject hashBits', I
avoid this special case, and I can use the first version of the code
everywhere.

IE, this second case is a hack to workaround the flaw that I'm fixing.
With my patch, its now no longer necessary, and we can use hte first
version everywhere.

>
> But this repair doesn't work well for very large sets, since the spreading
> of the identity hash leads to big chunks which have to be searched linearly.
>

The second version works for everything with no problem.

> After many mails and some thinking ;-)
> The last suggestion of Scott in
> 	Subject: [FIX] Re: [FIX] identityHash and fixing collections. (repost)
> 	Date: Sun, 9 Dec 2001 19:18:52 -0500 (EST)
> to use a factor of $p=135859$ for multiplying the hashbits solves this
> problem for Sets up to a size of about 100k regarding objects using
> identityHash as hash with a similar effect as the repair (for IdentitySets)
> above.

For $p=ANYTHING$, it solves the problem as long as the array is smaller
than $p*4096$, AND, the array size is not a multiple of p.

IE, p=5 is fine, as long as we don't store more than 20480 things in the
set, and the array size is not a multiple of 5. But, 5 is a small prime,
so this risk of getting a multiple of it by accident is high. Ergo, I
choose a larger prime.

>
> For sets sized above the 100k limit a better hash function is needed in
> every case, since further spreading decreases performance too much.
>

A better hash function should be used, because with 100k, you expect to
have to do about 25 probes to check membership. (without my patch, you'd
expect 50k probes). But, 25, although slower than the 1.5 probes needed by
a better hash function, is still very fast compared to a 100k linear
search. :)

Scott






More information about the Squeak-dev mailing list