Hash rehashed (was: [SqF]Report of VI4 Project for Feb '02)

Martin McClure martin at hand2mouse.com
Tue Feb 5 04:01:27 UTC 2002


At 4:18 PM -0800 2/4/02, Tim Rowledge wrote:
>Is there any value in making the putative hash-multiplier be dependant
>upon the size of the collection into which the object is being put?
>The downside that immediately pops up is that this value would then
>change as you grow the collection which would presumably require moving
>all the previously placed objects; except of course when you grow a
>collection you have to copy all the entries to the new array anyway, at
>least in most cases.
>
>This could be implemented by adding an instvar to all collections,
>though I fear that would be rather wasteful since so many collections
>are small enough that the multiplier would be 1. If a single list of
>values would be good (ie all varieties of collection can use the same
>values) then a trivial prim could take the array size, the object hash
>and do the entire lookup/algorithm - multiply - modulo operation.

Hmm. Possible, but I don't find it attractive. The main reasons, at 
least initially, are aesthetic.

Looking at the overall hashed collection design, it feels a lot more 
like what's broken is identityHash, and that the collections are 
fine. Thus, it feels more 'right' to fix identityHash.

It also looks like if we choose a multiplier (or other simple 
algorithm) for producing a hash value from the header hash bits, and 
we choose it *very* carefully, it will work for almost all collection 
array sizes. The 'almost' could lead to a slightly uglier design, 
having the collections avoid certain sizes (as Scott mentioned in his 
reply.)

Another, more technical and less aesthetic, argument against fixing 
it in the collections is that many collections are based on #hash, 
not on #identityHash, but since the default implementation of #hash 
is #identityHash the collection can't tell which objects it contains 
would need to have their hash values multiplied. You don't want to 
multiply other hash values, because if the implementers of that 
object's #hash have been good the hash value spans the entire 
SmallInteger range, and you'd end up pushing the computation into 
large integer math, which would be a slowdown that is unnecessary in 
this case.

-Martin



More information about the Squeak-dev mailing list