Good Hashing paper

Scott A Crosby crosby at qwes.math.cmu.edu
Thu Feb 21 03:09:56 UTC 2002


On Thu, 21 Feb 2002, Stephan Rudlof wrote:

> >
> > BTW, we might be able to make hashing strings and byte arrays go a bit
> > faster, by, say, only hashing the first 100 bytes.
>
> There could be a problem: what about a set of byte arrays equal in just the
> first 100 bytes?
> I'd prefer the slower, but safe version, as default.

Not a problem. Any hash table *has* to deal with two objects, by chance,
hashing to the same value. A conforming hash-code generation
implementaiton can return '42' as the hash code for all objects. (Try it!)

The only requirement is that if 'a=b' then 'a hash == b hash'

By cutting it off at 100, you're assuming that almost all object that are
the same for the first 100 bytes are the same throughout.. Which is IMHO a
fairly safe assumption, and it avoids stupid cases of, say, trying to hash
8kb of email text. :)

Yes, it'll suck if someone has hundreds of strings that differ only on the
101'st byte, but.... Except for a contrived example, where would you see
this? :) [*]

Scott

[*] And even in this case, it may be a win if this is only true for a
small percentage of the strings.  You lose time from slightly more
ineffecient hashing on some strings, but if the strings are 8kb long, you
save a factor of 1.8 on all the other strings because you are spending 80x
less time on computing the hash code. (Normally, you hash, then compare
the entry there for equality, if hashing is 80x cheaper, this cuts the
total time by a lot.)





More information about the Squeak-dev mailing list