Good Hashing paper
Martin McClure
martin at hand2mouse.com
Thu Feb 21 04:55:54 UTC 2002
At 10:09 PM -0500 2/20/02, Scott A Crosby wrote:
>On Thu, 21 Feb 2002, Stephan Rudlof wrote:
>
>> >
>> > BTW, we might be able to make hashing strings and byte arrays go a bit
>> > faster, by, say, only hashing the first 100 bytes.
>>
>> There could be a problem: what about a set of byte arrays equal in just the
>> first 100 bytes?
>> I'd prefer the slower, but safe version, as default.
>
>Not a problem. Any hash table *has* to deal with two objects, by chance,
>hashing to the same value. A conforming hash-code generation
>implementaiton can return '42' as the hash code for all objects. (Try it!)
>
>The only requirement is that if 'a=b' then 'a hash == b hash'
>
>By cutting it off at 100, you're assuming that almost all object that are
>the same for the first 100 bytes are the same throughout.. Which is IMHO a
>fairly safe assumption, and it avoids stupid cases of, say, trying to hash
>8kb of email text. :)
>
>Yes, it'll suck if someone has hundreds of strings that differ only on the
>101'st byte, but.... Except for a contrived example, where would you see
>this? :) [*]
Well, you'd think so, but beware. I can't think of a non-contrived
example, but just because I can't readily see a real-world example
doesn't mean there isn't a common one. Folks have been tripped up by
this before.
Java 1.1 only hashed part of Strings (something like the first 15
chars and the last 15 chars). They obviously didn't expect to see too
many different Strings that were identical at both the beginning and
the end. Turned out that they had horrible performance on collections
of URLs. In Java 1.2, the whole string was hashed, no matter how long.
This change made me do some quick scrambling at GemStone, since I had
to make sure that 1.1 hashed collections could be migrated to 1.2....
-Martin
More information about the Squeak-dev
mailing list
|