[RFC] Integer hash

Sat Aug 10 01:37:47 UTC 2002

David,

	What you present here is essentially a variation of the fairly standard table driven CRC algorithm, which is by definition "the residue of a high order polynomial modulo an irreducible polynomial (conceptually prime)".

	Again deferring to Bob Jenkins, both CRC type hashing and "Universal Hashing" both exhibit funneling and consequently undesirable statistical properties, in addition to "Universal Hashing" being resource intensive (i.e. memory and especially CPU time).  Both of these types of hash functions though have been empirically shown to be "pretty good" for many real data sets.

	See the following URL for details:

		http://burtleburtle.net/bob/hash/examhash.html

and inserted below are the relevant sections on CRC and Universal hashing.

	Now I'm not necessarily saying that Squeak should use Bob Jenkins's hash functions (although I wouldn't be too opposed to that solution), but it doesn't seem wise to use a hash function that has know potential for undesired behavior.  This is especially true when the undesired behavior is rarely encountered because it can be terribly difficult to debug when a certain data set evokes pathological behavior from the hash function.

	It's also a fairly widely held belief that the design of hash functions is more art than science, and I tend to agree.  Since most cryptographic hashes are known to be "pretty good" to "excellent" and Squeak already has an implemenation of SHA, it wouldn't hurt to give that a try as a possible candidate for this problem.

	I do agree though that hash performance isn't everything and that there is much to be said for good statistical properties.  The whole idea of hash tables is to attempt to reduce table lookups to O(1) rather than O(n log n) or even O(n) (i.e. linear search), but of course it's really going to be O(1 + C) where C is proportional to the cost of the hash function.  Bad statistical properties lead to collisions, which Squeak resolves with linear searching, so Squeak's dictionaries and sets will be somewhere between O(1) and O(n) depending on how good the hash function is.

	Also, as Ned has demonstrated, sometimes a plain linear search is the fastest thing to do when you account for the cost of the hash function.

			-Dean

CRC hashing Knuth5 

int crc( char *key, int len, int mask, int tab[256])
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i)
    hash = (hash<<8)^tab[(hash>>24)^key[i]];
  return (hash & mask);
}

This takes 9n+3 instructions. tab is initialized so that the low-order byte is a permutation. It produces a full 32-bit result. Bytes are not commutative. This would have no funnels if it stopped there, but it usually doesn't. This hash is usually used to simulate a maximal-length Linear Feedback Shift Register (LFSR) Golumb. An LFSR, rather than shifting a byte at a time, shifts a bit at a time, XORing some shift mask with the hash value depending on the bit being shifted out. The values of tab are are the accumulated XORs of all possible sequences of 8 bits shifting out. Any input bit followed by the bits in the shift mask form a funnel, and any two funnels XORed together form another funnel. (This is an example of a linear mixing step, and how it spontaneously unmixes old inputs.) For the tests, we used a 32-bit state with a shift mask of 0x04c11db7. 

Universal hashing UNI: 

int universal( char *key, int len, int mask, int tab[MAXBITS])
{
  int hash, i;
  for (hash=level, i=0; i<(length<<3); i+=8)
  {
    register char k = key[i>>3];
    if (k&0x01) hash ^= tab[i+0];
    if (k&0x02) hash ^= tab[i+1];
    if (k&0x04) hash ^= tab[i+2];
    if (k&0x08) hash ^= tab[i+3];
    if (k&0x10) hash ^= tab[i+4];
    if (k&0x20) hash ^= tab[i+5];
    if (k&0x40) hash ^= tab[i+6];
    if (k&0x80) hash ^= tab[i+7];
  }
  return (hash &= mask);
}

This hash takes 52n+3 instructions to hash n keys. It produces a full 32-bit result. Bits are not commutative. It requires as many table entries as there are possible input bits, and they are chosen at random. Since every input bit can change only about half the output bits, there are funnels. Changing the table values will change where the funnels are, but it always has funnels. Although this hash is useful for theoretical computer science, it should be avoided in practice because of its funnels and its speed. 

-----Original Message-----
From: David Chase [mailto:chase at world.std.com]
Sent: Friday, August 09, 2002 7:55 PM
To: squeak-dev at lists.squeakfoundation.org
Subject: Re: [RFC] Integer hash

At 03:59 PM 8/9/2002 +0200, Stephan Rudlof wrote:
>David Chase wrote:
>> but mathematically, I don't know if
>> there is much to be said about them.
>
>What Do you mean here?

Larry Carter and Mark Wegman (and maybe some other people)
wrote a paper on "universal hashing".  The idea is (more or less)
that there is an entire family of hash functions that are proven
to convert "almost any" input into something that has good statistical
properties.  "Almost any" generally includes all the usual inputs,
meaning words, phrases, a1 through a999, that sort of thing.
They're not necessarily cryptographically secure, but they're
still "good enough".

Here's an example of a pretty-good hash function that is
not as fast as a hash function could be, but it has interesting
mathematical properties.  This is for Java character arrays,
that contain 16 unsigned bits per element:

  public static int hashGalois(char[] x) {
    int sum = 0;
    int l = x.length;
    int i;
    for (i = 0; i < l; i++) {
      int ch = x[i];
      int cl = ch & 255;
      ch = ch >> 8;
      int highByte = sum >>> 24;
      sum = ((sum << 8) + ch) ^ table[highByte]; 
      highByte = sum >>> 24;
      sum = ((sum << 8) + cl) ^ table[highByte]; 
    }

    // Push an extra 32 bits of polynomial through

    int highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    highByte = sum >>> 24;
    sum = ((sum << 8)) ^ table[highByte]; 
    return sum;
  }

// Table is initialized carefully.

This computes an actual algebraic quantity, the residue of a
degree 16N+32 polynomial (N = number of characters) modulo
a particular "irreducible polynomial" (conceptually, a prime)
in the field GF(2**32).  These are known to have nice statistical
properties, and are used (for example) in the design of "good"
sound reflectors, and in communications networks to avoid
hotspots.

Code that generates irreducible polynomials in GF(2**N) can
be found here:

http://world.std.com/~chase/src/galois/TwoPolynomials.java

It contains references to useful papers.

It's possible to reverse-engineer one of these guys to
get it to behave badly, but generally, they do not, and
it is unlikely that a "bad" input will resemble anything
that you are likely to feed it.  In contrast, ad hoc
hash functions are only known to work well for the small
set of things on which they have been tested; in the wider
world, they might well fall apart on most inputs.  You have
no way of knowing.

I'll try to get more information on the universal hash functions.
The "good one", by the time I was done making it faster, looked
horrible (100 lines or so of code), but it is fast, and it is
both empirically and mathematically "good".  I think it should be
possible to write a good-enough-fast-enough hash function that
is not so complex.

David Chase