[RFC] Integer hash

David Chase chase at world.std.com
Fri Aug 9 23:55:25 UTC 2002


At 03:59 PM 8/9/2002 +0200, Stephan Rudlof wrote:
>David Chase wrote:
>> but mathematically, I don't know if
>> there is much to be said about them.
>
>What Do you mean here?

Larry Carter and Mark Wegman (and maybe some other people)
wrote a paper on "universal hashing".  The idea is (more or less)
that there is an entire family of hash functions that are proven
to convert "almost any" input into something that has good statistical
properties.  "Almost any" generally includes all the usual inputs,
meaning words, phrases, a1 through a999, that sort of thing.
They're not necessarily cryptographically secure, but they're
still "good enough".

Here's an example of a pretty-good hash function that is
not as fast as a hash function could be, but it has interesting
mathematical properties.  This is for Java character arrays,
that contain 16 unsigned bits per element:

  public static int hashGalois(char[] x) {
    int sum = 0;
    int l = x.length;
    int i;
    for (i = 0; i < l; i++) {
      int ch = x[i];
      int cl = ch & 255;
      ch = ch >> 8;
      int highByte = sum >>> 24;
      sum = ((sum << 8) + ch) ^ table[highByte]; 
      highByte = sum >>> 24;
      sum = ((sum << 8) + cl) ^ table[highByte]; 
    }

    // Push an extra 32 bits of polynomial through

    int highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    return sum;
  }

// Table is initialized carefully.

This computes an actual algebraic quantity, the residue of a
degree 16N+32 polynomial (N = number of characters) modulo
a particular "irreducible polynomial" (conceptually, a prime)
in the field GF(2**32).  These are known to have nice statistical
properties, and are used (for example) in the design of "good"
sound reflectors, and in communications networks to avoid
hotspots.

Code that generates irreducible polynomials in GF(2**N) can
be found here:

http://world.std.com/~chase/src/galois/TwoPolynomials.java

It contains references to useful papers.

It's possible to reverse-engineer one of these guys to
get it to behave badly, but generally, they do not, and
it is unlikely that a "bad" input will resemble anything
that you are likely to feed it.  In contrast, ad hoc
hash functions are only known to work well for the small
set of things on which they have been tested; in the wider
world, they might well fall apart on most inputs.  You have
no way of knowing.

I'll try to get more information on the universal hash functions.
The "good one", by the time I was done making it faster, looked
horrible (100 lines or so of code), but it is fast, and it is
both empirically and mathematically "good".  I think it should be
possible to write a good-enough-fast-enough hash function that
is not so complex.

David Chase





More information about the Squeak-dev mailing list