On Wed, 02 Aug 2000 14:46:51 -0700 Andres Valloud avalloud@exobox.com wrote:
...
And this is with a hash that takes longer to calculate, but it sure helps in the end.
Cheers, Bob ---- a faster hash ----- biggerHash
self size <= 2 ifTrue: [^self hash]. ^self inject: self size into: [ :sum :each | (sum + sum + each asciiValue) bitAnd: 16rFFFFFF]
Note that since this left shifts (sum+sum) and masks the running total, only the last 24 characters contribute to the hash (and the 24th-last character only contributes one bit, etc). I prefer a slightly slower mechanism that gives *really* good distribution...
Treat each string as a polynomial in M, where each position p in the string is the character value times M^p. Evaluate that polynomial with M=1664525, modulo 2^28 (this should be half the maximum integer so this hash value can be easily combined with others without risk of overflow via + or -. This constant is taken from Knuth -- it's a multiplicative generator of order 2^30 in the group (*,Z[2^32]). To avoid overflow into large integers, each multiplication can be broken down into a couple of smaller ones:
-------------------------- !Integer methodsFor: 'hashing helper'! timesMultiplier "Compute (self * 1664525) mod 2^28. We use radix 16384 to do the multiplication because 16384=2^14 and 16384^2 = 2^28. Using hex, 1664525 = 16r0019660D = 16r0065*16384 + 16r260D."
| low hi res | low := self bitAnd: 16383. hi := self bitShift: -14. res := ((16r260D * low) + (((16r260D * hi) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. "[res = (1664525 * self bitAnd: 2**28-1)] assert." ^res! --------------------------
And the string hashing method is:
-------------------------- !String methodsFor: 'better hashing'! goodHash
| hash | hash := 0. self size to: 1 by: -1 do: [:i | hash := (hash + (self at: i) asInteger) timesMultiplier]. ^hash! --------------------------
I tried these methods under VisualWorks 3.0NC on a 233MHz PowerBook, and executing "Time millisecondsToRun: [1000000 timesRepeat: ['hello' goodHash]]" gave me 6233ms, which is 6.2 microseconds to hash 'hello' once.
In case anyone's interested, the fact that I use a good multiplicative random generator as the "variable" of the polynomial is so that each character position has a weight independent of all other character positions. It also just so happens to support my rope implementation in interesting ways. For example, (all equations are modulo 2**28)...
hash('hellothere') = hash('hello') + (hash('there') * M^5)
hash('hellop') = hash('hello') + ($p * M^6).
hash('hellotherz') = hash('hellothere') + (($z - $e) * M^10).
etc. This allows me to maintain cached hash values as strings are concatenated, modified, etc.
P.S., When I did a distribution check using a large dictionary of words, there were *no* duplicated hash values (I forget the size of the list, but I remember doing the math). This was just slightly *better* than the birthday paradox predicted for completely *random* numbers (it expected somewhere between one and two collisions, I believe).
squeak-dev@lists.squeakfoundation.org