Good Hashing paper

Thu Feb 21 05:39:00 UTC 2002

Scott A Crosby wrote:
> 
> On Thu, 21 Feb 2002, Stephan Rudlof wrote:
> 
> > >
> > > BTW, we might be able to make hashing strings and byte arrays go a bit
> > > faster, by, say, only hashing the first 100 bytes.
> >
> > There could be a problem: what about a set of byte arrays equal in just the
> > first 100 bytes?
> > I'd prefer the slower, but safe version, as default.
> 

> Not a problem. Any hash table *has* to deal with two objects, by chance,
> hashing to the same value. A conforming hash-code generation
> implementaiton can return '42' as the hash code for all objects. (Try it!)

I know (without trying ;-) ).
'Safe' possibly is to hard here: I mean a good distribution of hash values
for *every* case, pathological cases included.

> 
> The only requirement is that if 'a=b' then 'a hash == b hash'
> 
> By cutting it off at 100, you're assuming that almost all object that are
> the same for the first 100 bytes are the same throughout.. Which is IMHO a
> fairly safe assumption, and it avoids stupid cases of, say, trying to hash
> 8kb of email text. :)

BTW: If you are hashing e.g. 8kb mail messages it takes much more time to
read them from disc than for hashing...

> 
> Yes, it'll suck if someone has hundreds of strings that differ only on the
> 101'st byte, but.... Except for a contrived example, where would you see
> this? :) [*]

There may be other byte arrays to be hashed, e.g.:
- images where the pixels in the beginning are equal,
- numbers between 2 different very LargeIntegers;
you get the idea.

> 
> Scott
> 

> [*] And even in this case, it may be a win if this is only true for a
> small percentage of the strings.  You lose time from slightly more
> ineffecient hashing on some strings, but if the strings are 8kb long, you
> save a factor of 1.8 on all the other strings because you are spending 80x
> less time on computing the hash code. (Normally, you hash, then compare
> the entry there for equality, if hashing is 80x cheaper, this cuts the
> total time by a lot.)

An idea: what about a hash function with your limit as argument (or none/nil
for the default case)?

Greetings,

Stephan
-- 
Stephan Rudlof (sr at evolgo.de)
   "Genius doesn't work on an assembly line basis.
    You can't simply say, 'Today I will be brilliant.'"
    -- Kirk, "The Ultimate Computer", stardate 4731.3