Horrible unscalability, how do I change hashBitsOf: ?

Scott A Crosby crosby at qwes.math.cmu.edu
Sat Oct 6 19:45:07 UTC 2001


On Sat, 6 Oct 2001, Stephen Pair wrote:

> With only 5 bits for an object hash, the largest range of hash values is
> going to be 4096, no matter what kind of algorithm you use to derive the

I believe its 12 bits to get 4096 different codes..

Again, what I am dealing with is something more subtle. I'm not
complaining about only being 4096 possible values, but rather that those
4096 values are all distributed in the range 0--4095.

And why *THAT* makes large collections 100x more ineffecient than they
should be. (See my previous message for the gory details.).

That reduced output range makes those hash bits useless for any large
Dictionary/Set.

If you look at the code for IdentitySet/IdentityDictionary, you see that
they already embody this idea:

IdentitySet>>scanFor: anObject
	elements. "
	| finish hash start element |
	finish _ array size.

	finish > 4096
		ifTrue: [hash _ anObject identityHash * (finish // 4096)]
		ifFalse: [hash _ anObject identityHash].
--

Note the multiplication by (finish // 4096) in order to extend the output
range of identityHash to more than 0--4095, thus working around the flaw,
at the cost of having two classes (Set, IdentitySet) where only one is
needed (Set).

Change hashBitsOf:, and you get this for free; the only time you need to
use IdentityDictionary/IdentitySet is if the objects you're hashing have a
#hash selector, but you want to use #identityHash for some particular
reason.

As things are now, if you try to put more than 4000 objects which do NOT
define a custom #hash selector with a large range into any Set or
Dictionary, it will be at least 100x slower.

(As a temporary workaround, IdentitySet/IdentityDictionary have the above
code, so will not suffer this slowdown, but that doesn't mean that
Set/Dictionary should not be fixed.)


Scott


--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.





More information about the Squeak-dev mailing list