Re: [BUG] Symbol>>hash - Squeak-dev

3 Aug 2000


      ...
On Wed, 02 Aug 2000 14:46:51 -0700 Andres Valloud avalloud@exobox.com wrote:
...
...
And this is with a hash that takes longer to calculate, but it sure helps
in the end.
Cheers,
Bob
---- a faster hash -----
biggerHash
self size <= 2 ifTrue: [^self hash].
 ^self inject: self size into: [ :sum :each | (sum + sum + each asciiValue)
bitAnd: 16rFFFFFF]
Note that since this left shifts (sum+sum) and masks the running total, only
the last 24 characters contribute to the hash (and the 24th-last character
only contributes one bit, etc).  I prefer a slightly slower mechanism that
gives *really* good distribution...
Treat each string as a polynomial in M, where each position p in the string
is the character value times M^p.  Evaluate that polynomial with M=1664525,
modulo 2^28 (this should be half the maximum integer so this hash value can
be easily combined with others without risk of overflow via + or -.  This
constant is taken from Knuth -- it's a multiplicative generator of order
2^30 in the group (*,Z[2^32]).  To avoid overflow into large integers, each
multiplication can be broken down into a couple of smaller ones:
--------------------------
!Integer methodsFor: 'hashing helper'!
timesMultiplier
   "Compute (self * 1664525) mod 2^28.  We use radix 16384 to do the
    multiplication because 16384=2^14 and 16384^2 = 2^28.  Using hex,
    1664525 = 16r0019660D = 16r0065*16384 + 16r260D."
| low hi res |
   low := self bitAnd: 16383.
   hi := self bitShift: -14.
   res := ((16r260D * low) + (((16r260D * hi) + (16r0065 * low) bitAnd:
16383) * 16384)) bitAnd: 16r0FFFFFFF.
   "[res = (1664525 * self bitAnd: 2**28-1)] assert."
   ^res!
--------------------------
And the string hashing method is:
--------------------------
!String methodsFor: 'better hashing'!
goodHash
| hash |
   hash := 0.
   self size to: 1 by: -1 do: [:i |
      hash := (hash + (self at: i) asInteger) timesMultiplier].
   ^hash!
--------------------------
I tried these methods under VisualWorks 3.0NC on a 233MHz PowerBook, and
executing "Time millisecondsToRun: [1000000 timesRepeat: ['hello'
goodHash]]" gave me 6233ms, which is 6.2 microseconds to hash 'hello' once.
In case anyone's interested, the fact that I use a good multiplicative
random generator as the "variable" of the polynomial is so that each
character position has a weight independent of all other character
positions.  It also just so happens to support my rope implementation in
interesting ways.  For example, (all equations are modulo 2**28)...
hash('hellothere') =
        hash('hello') + (hash('there') * M^5)
hash('hellop') = hash('hello') + ($p * M^6).
hash('hellotherz') = hash('hellothere') + (($z - $e) * M^10).
etc.  This allows me to maintain cached hash values as strings are
concatenated, modified, etc.
P.S., When I did a distribution check using a large dictionary of words,
there were *no* duplicated hash values (I forget the size of the list, but I
remember doing the math).  This was just slightly *better* than the birthday
paradox predicted for completely *random* numbers (it expected somewhere
between one and two collisions, I believe).