Good Hashing paper

Thu Feb 21 04:55:54 UTC 2002

At 10:09 PM -0500 2/20/02, Scott A Crosby wrote:
>On Thu, 21 Feb 2002, Stephan Rudlof wrote:
>
>>  >
>>  > BTW, we might be able to make hashing strings and byte arrays go a bit
>>  > faster, by, say, only hashing the first 100 bytes.
>>
>>  There could be a problem: what about a set of byte arrays equal in just the
>>  first 100 bytes?
>>  I'd prefer the slower, but safe version, as default.
>
>Not a problem. Any hash table *has* to deal with two objects, by chance,
>hashing to the same value. A conforming hash-code generation
>implementaiton can return '42' as the hash code for all objects. (Try it!)
>
>The only requirement is that if 'a=b' then 'a hash == b hash'
>
>By cutting it off at 100, you're assuming that almost all object that are
>the same for the first 100 bytes are the same throughout.. Which is IMHO a
>fairly safe assumption, and it avoids stupid cases of, say, trying to hash
>8kb of email text. :)
>
>Yes, it'll suck if someone has hundreds of strings that differ only on the
>101'st byte, but.... Except for a contrived example, where would you see
>this? :) [*]

Well, you'd think so, but beware. I can't think of a non-contrived 
example, but just because I can't readily see a real-world example 
doesn't mean there isn't a common one. Folks have been tripped up by 
this before.

Java 1.1 only hashed part of Strings (something like the first 15 
chars and the last 15 chars). They obviously didn't expect to see too 
many different Strings that were identical at both the beginning and 
the end. Turned out that they had horrible performance on collections 
of URLs. In Java 1.2, the whole string was hashed, no matter how long.

This change made me do some quick scrambling at GemStone, since I had 
to make sure that 1.1 hashed collections could be migrated to 1.2....

-Martin