[RFC] Integer hash

Swan, Dean Dean_Swan at Mitel.COM
Thu Aug 8 21:56:56 UTC 2002


Andreas,

	Please pick another hash function.

	Just using the low order 31 bits of a large integer as a hash value is probably NOT a very good idea.  The low order bits of many real data sets tend to be highly correlated and therefore make a bad hash key.

	This is essentially the same as taking the large integer modulus 2^31.  Since 2^31 isn't prime, this is already going to produce a biased distribution, and even if you were using a prime modulus, Mersenne Primes (i.e. primes of form 2^n - 1) often give significant bias as well.  By bias here I mean non-uniform distribution and/or lots of collisions.

	I know the idea is to pick something that works fairly well most of the time, but also has a low execution cost.  These goals are somewhat at cross purposes.  Any solution that doesn't look at ALL the digits is going to have undesirable hash properties.

	Squeak already has the Secure Hash algorithm.  That wouldn't be a bad choice.  It's not the fastest algorithm, but it's not bad.  Even a CRC32 would be better than the proposal, and it can be implemented fairly efficiently using table lookup.

	Here are some URLs for additional references on hashing:

http://burtleburtle.net/bob/hash/hashfaq.html
http://burtleburtle.net/bob/hash/examhash.html
http://burtleburtle.net/bob/hash/evahash.html
http://burtleburtle.net/bob/

http://www.cs.sunysb.edu/~skiena/214/lectures/lect21/lect21.html

I really like the work Bob Jenkins has done in this area, and he's a fairly well respected authority on the topic.

						-Dean

P.S.  All the class comments in the "Digital Signatures" classes reference a book "Applied Cryptography: Protocols, Algorithms, and Source Code in C".  The author of that book is Bruce Schnier, with an 'n', not Bruce Schier.  I think John Maloney wrote this stuff.  He has a nasty ;) habit of commenting his code and citing references.


-----Original Message-----
From: Andreas Raab [mailto:Andreas.Raab at gmx.de]
Sent: Wednesday, August 07, 2002 5:33 PM
To: squeak-dev at lists.squeakfoundation.org
Subject: [RFC] Integer hash


Folks,

You might not have noticed the discussion about Magma's performance but
one issue that came up was that LargeInteger>>hash has a pretty poor
implementation. It currently says

Integer>>hash
	"Hash is reimplemented because = is implemented."
	^(self lastDigit bitShift: 8) + (self digitAt: 1)

and the request has been made to change this into

Integer>>hash
	"Hash is reimplemented because = is implemented."
	^self bitAnd: 16r3FFFFFFF

so as to get better hashes (more values, larger spread). This request is
meant to find out if anyone could see a reason not to make that change.
Personally, I don't see any but since the change means that many
sets/dictionaries will be affected I thought it might be a good idea to
ask if there's any objection to making that change.

If you see a reason not to change #hash in the proposed way, speak up
now or be ready to bear the consequences ;-)

Cheers,
  - Andreas




More information about the Squeak-dev mailing list