identityHash patch discussion.

Tue Dec 11 02:24:15 UTC 2001

Hi Scott,

> The problem is that those bits are used as-is, to construct a number
> in the *specific* range 0...4095, which is used as the starting index
> in Set>>scanFor: This flaw means that *any* set and *any* dictionary
> that has 4000 elements in it will devolve into a linear search
> through all of the elements, when it could be 4000 times faster.

Wouldn't this only happen for IdentitySets and IdentityDictionaries? 
Plain Sets and Dictionaries use #hash, which usually provides a
smallInteger's worth of hash bits.  And of course should this not be
enough, it's easy to extend #hash into the LargePositiveIntegers.

> This is *not* a problem in there being only 12 bits in the object header,
> but that those 12 bits are interpreted as values in the *specific* range
> 0...4095.
> 
> My patch just multiplies that 12-bit default-hash number by a constant, to
> spread the numbers out to avoid this horrible unscalability. The hard part
> is getting it into the image, because you have to rebuild all existing
> hash tables.

It is not very difficult to change #hash if you don't modify
#identityHash.

At any rate, when it comes to the modular arithmetic hashed collections
use to map hash values into an array, it really doesn't matter if you
multiply the values by constants or things like that.  After all, after
\\ sizeOfArray is sent, all those values will be constrained to the
range 1..sizeOfArray anyway.

If my memory serves me right, when the 4096 possible original
#identityHash values are well distributed, after being multiplied by a
constant the results will also be well distributed over a larger range. 
Moreover, they will also be well distributed when taken \\ anInteger. 
This works only if the constant is chosen carefully, like a prime that
has no common factors with sizeOfArray.  It doesn't even have to be a
big prime: any integer inversible in the corresponding modular
arithmetic will do.  So for 4096, even 3 would be fine.  For 2^32 - 1, 2
would be just as good as the largest Mersenne prime known.

Regardless of the constant size, if everything worked perfectly, you
would achieve at most a permutation of #identityHash values.  Thus, as
far as distribution goes, nothing is really improved.

I think that in this kind of situations, it's easier to use Sets and
Dictionaries instead of their identity counterparts so you can take
advantage of #hash.  Then you don't have to do anything to achieve
better lookup performance.

I don't have all the previous discussion here.  I'd be interested to
read about something that is inefficient because of the short
#identityHash which can't be solved without using a larger
#identityHash.

Andres.