[RFC] Integer hash
chase at world.std.com
Fri Aug 9 02:22:59 UTC 2002
At 05:56 PM 8/8/2002 -0400, Swan, Dean wrote:
> Please pick another hash function.
> Just using the low order 31 bits of a large integer as a hash value is probably NOT a very good idea. The low order bits of many real data sets tend to be highly correlated and therefore make a bad hash key.
We (= friends, including the people who did the universal hash function
work at IBM) looked into this some in the context of Java (Java has a
collection of horrible hash functions). I don't know if I am allowed to
discuss exactly what we came up with (it is in unpublished we're-too-busy
limbo) but you could do much worse than (just for example) taking the 64-bit
result obtained by multiplying the (unsigned) 32-bit address by some
arbitrarily chosen (unsigned, large) 32-bit prime, modulo 2**31-1 (a
Mersenne prime, I think). Do understand, multiplication is startlingly
fast nowadays, and if you wanted to make your hash function difficult
for an adversary to defeat, you could (randomly) look for big primes
at program startup. Hash function performance isn't everything; there's
also a lot to be said for good statistical properties.
(There's also a trick for computing N mod 2**31-1; I will recreate it
or dredge it up from old email)
The hash functions for strings were arrived by combining a large collection
of such "randomly" chosen primes, arranged so that many of the multiplications
could proceed in parallel (under the assumption that many modern chips can
do that) before summing. The result was "pretty good" by most measures;
almost as fast as Sun's crappy String hash function, but with mathematically
provable goodness, and empirically demonstrated goodness. The BurtleBob
hashes are empirically interesting, but mathematically, I don't know if
there is much to be said about them.
More information about the Squeak-dev