[BUG] Symbol>>hash
Mark van Gulik
ghoul6 at home.net
Thu Aug 3 06:26:27 UTC 2000
> On Wed, 02 Aug 2000 14:46:51 -0700 Andres Valloud <avalloud at exobox.com> wrote:
...
> And this is with a hash that takes longer to calculate, but it sure helps
> in the end.
>
> Cheers,
> Bob
> ---- a faster hash -----
> biggerHash
>
> self size <= 2 ifTrue: [^self hash].
> ^self inject: self size into: [ :sum :each | (sum + sum + each asciiValue)
> bitAnd: 16rFFFFFF]
Note that since this left shifts (sum+sum) and masks the running total, only
the last 24 characters contribute to the hash (and the 24th-last character
only contributes one bit, etc). I prefer a slightly slower mechanism that
gives *really* good distribution...
Treat each string as a polynomial in M, where each position p in the string
is the character value times M^p. Evaluate that polynomial with M=1664525,
modulo 2^28 (this should be half the maximum integer so this hash value can
be easily combined with others without risk of overflow via + or -. This
constant is taken from Knuth -- it's a multiplicative generator of order
2^30 in the group (*,Z[2^32]). To avoid overflow into large integers, each
multiplication can be broken down into a couple of smaller ones:
--------------------------
!Integer methodsFor: 'hashing helper'!
timesMultiplier
"Compute (self * 1664525) mod 2^28. We use radix 16384 to do the
multiplication because 16384=2^14 and 16384^2 = 2^28. Using hex,
1664525 = 16r0019660D = 16r0065*16384 + 16r260D."
| low hi res |
low := self bitAnd: 16383.
hi := self bitShift: -14.
res := ((16r260D * low) + (((16r260D * hi) + (16r0065 * low) bitAnd:
16383) * 16384)) bitAnd: 16r0FFFFFFF.
"[res = (1664525 * self bitAnd: 2**28-1)] assert."
^res!
--------------------------
And the string hashing method is:
--------------------------
!String methodsFor: 'better hashing'!
goodHash
| hash |
hash := 0.
self size to: 1 by: -1 do: [:i |
hash := (hash + (self at: i) asInteger) timesMultiplier].
^hash!
--------------------------
I tried these methods under VisualWorks 3.0NC on a 233MHz PowerBook, and
executing "Time millisecondsToRun: [1000000 timesRepeat: ['hello'
goodHash]]" gave me 6233ms, which is 6.2 microseconds to hash 'hello' once.
In case anyone's interested, the fact that I use a good multiplicative
random generator as the "variable" of the polynomial is so that each
character position has a weight independent of all other character
positions. It also just so happens to support my rope implementation in
interesting ways. For example, (all equations are modulo 2**28)...
hash('hellothere') =
hash('hello') + (hash('there') * M^5)
hash('hellop') = hash('hello') + ($p * M^6).
hash('hellotherz') = hash('hellothere') + (($z - $e) * M^10).
etc. This allows me to maintain cached hash values as strings are
concatenated, modified, etc.
P.S., When I did a distribution check using a large dictionary of words,
there were *no* duplicated hash values (I forget the size of the list, but I
remember doing the math). This was just slightly *better* than the birthday
paradox predicted for completely *random* numbers (it expected somewhere
between one and two collisions, I believe).
More information about the Squeak-dev
mailing list
|