Horrible unscalability, how do I change hashBitsOf: ?

Martin McClure martin at hand2mouse.com
Sun Oct 7 03:32:33 UTC 2001


At 3:37 PM -0400 10/5/01, Scott A Crosby wrote:

[...]

>
>The hashing strategy used by dictionaries and sets is non-optimal if the
>hashcode returned is not over a wide range..

This is true. I believe that large hashed collections would be faster 
if the hash values were spread more widely, even if the number of 
values is still only 4096. However your solution...

>
>
>```
>hashBitsOf: oop
>	^ ((self baseHeader: oop)
>		>> 17 bitAnd: 4095) << 10
>'''

...making each hash value a multiple of 1024, is not a good answer. 
It creates a problem for many small collections. Those whose array 
size is a power of two would all hash to the same index. And 
MethodDictionaries are always sized at a power of two, and method 
lookup uses identity hash.

Other power of two array sizes probably aren't very common, but "Set 
new" sets the array size to 6, and Sets grow by doubling, so Sets and 
Dictionaries created without a specific size will always have an 
array size of 6 times a power of two. For small powers of two and 
hash values that are always a multiple of 1024, only three hash 
buckets ever get used.

The result of your fix, then, is a lot more linear probing within 
small hashed collections, and a lot less linear probing within large 
collections. The improvement to large collection lookup would 
probably be overshadowed by the loss of efficiency in method lookup.

For a better answer I'd want to look at the considerable literature 
on hashing, and also look at the previous discussion threads 
previously mentioned:

At 9:55 AM -0700 10/6/01, Tim Rowledge wrote:
>I think a long conversation about hashing can be found in the list
>archive; 'fraid I don't remember any dates to help you. It's a perennial
>topic in Smalltalk.

...but it does occur to me that you might get better results by 
multiplying by, say, 65537 than by 1024. Having only 4096 hash 
values, but having them be (0..4095)*65537 would bring the overall 
range to 28 bits, within shouting distance of SmallInteger limits 
without bumping into them. Multiplication by 65537 is also fast even 
on machines without fast integer multiply, it's a number or'ed with 
itself shifted by 16 bits. 65537 is also prime, which avoids the 
really degenerate cases where the collection's array size is evenly 
divisible by the hash multiple, and many of the other undesirable 
cases.



If you actually want to try it and see how well it works, the rest of 
this message is relevant.



I'm slightly uncomfortable about modifying hashBitsOf:, just because 
it would no longer be doing what its name suggests. Perhaps putting 
the modifications in Interpreter>>primitiveAsOop would be better, 
though you'd also have to change 
Interpreter>>lookupMethodInDictionary. Maybe just changing the name 
hashBitsOf: to something like identityHashOf: would make me happier.

At 3:37 PM -0400 10/5/01, Scott A Crosby wrote:
>
>To artificially increase the range. Now, this change will corrupt every
>Set, Dictionary, and who knows what else in the image.

Rehashing the collections isn't hard, the methods are there in Set to 
find them all and rehash.

You have to make sure method lookup works during the transition. 
Making a special transitional VM with 
Interpreter>>lookupMethodInDictionary modified to always scan the 
entire dictionary before returning false might be all that's 
required. Commenting out the line
	nextSelector=nilObj ifTrue: [^false].
looks like a way to do that, though the speed hit would mean you 
wouldn't want to leave that change in after you'd converted your 
image.

I don't know if anything else would break. Try it and see.

>
>But, as a bonus, it would nicely IdentitySet, IdentityDictionary, and
>make them almost no-ops, except for invoking identityHash instead of
>hash on the keys. (Make a method called 'hashKey' which gets overloaded by
>Identity*, and the two classes have but one method.)
>
>
>Think its doable, or something I should flee like the plague, cause it is
>too hard and won't make too much difference?

Magic 8 ball says... proceed with caution. :-)

-Martin




More information about the Squeak-dev mailing list