Hash rehashed (was: [SqF]Report of VI4 Project for Feb '02)

Martin McClure martin at hand2mouse.com
Sun Feb 3 02:49:00 UTC 2002


At 4:36 PM -0500 2/2/02, Scott A Crosby wrote:
[...]
>
>BTW<, using 65537 as the multiplier is a *BAD* idea.
>
>What are:
>    1*65537 \\ 4096
>    2*65537 \\ 4096
>    3*65537 \\ 4096
>    4*65537 \\ 4096
>    5*65537 \\ 4096
>       ....
>
>And why is it bad? :)
>

Well, this example has an ideal distribution for the hash bits range 
of 0..4095, every slot is filled exactly once.

However, if there really had only been 10 or 11 hash bits as I 
misremembered the other day (there are actually 12 hash bits) then 
this example would indeed point out a serious flaw in 65537 as a 
multiplier.

For a collection whose array was of size, say, 16384, all objects 
would hash to the first 4K slots. Yes, very bad. Thanks for pointing 
this out.

The reason is that using 65537 as a factor always yields hash values 
that have bits 12-15 zero.

Scott, last time I looked you were using a factor of 4097. Are you 
still using this factor?

If we stay with 12 bits in the header I don't have a better 
suggestion than a factor of 4097 right now, but it might be worth 
looking to see if we can find a better algorithm for spreading 
identity hash values.

Things that make 4097 less than ideal (though neither of these may be 
a big problem):
* Produces hash values in the range (0 to: (2 raisedTo: 24) - 1); 24 
bits. The ideal range is the entire SmallInteger range, 31 bits 
(though in either case we can only have 4096 values spread sparsely 
throughout the range, unless we add more header bits.)
* All hash values are divisible by 17 and by 241, which makes for bad 
performance for any collection whose array size happens to be a 
multiple of either value. I don't have Scott's patch loaded up to run 
the tests, but I expect that benchmarks done on an 'IdentitySet new: 
1445' and an 'IdentitySet new: 1446' would give quite different 
results.


-Martin



More information about the Squeak-dev mailing list