Incongruent hash

David N. Smith dnsmith at watson.ibm.com
Mon Feb 9 15:02:42 UTC 1998


At 14:16 -0500 2/8/98, Leandro Caniglia wrote:
Hi!
 
Luciano Notarfancesco told me this bug:
 
    1 = 1.0 true
but
    1 hash = 1.0 hash false
 
 
Saludos,
Leandro


That may be my fault. The hash code for Float was rewritten (by me) so that
it produces a passable hash value. The old one was truly awful. There was
discussion on the issue in the group; hopefully the archive still has it
(May 97). If not, I may have it stuck away somewhere.

If you really want to fix the problem, fix integer hash to convert an
integer to float before taking the hash. SmallInteger uses a primitive to
convert to float, so:

   hash
      self asFloat hash

should be fairly cheap. Long integers pose a bigger problem since floats
can represent about 52 bits of value and you might want to convert to float
when there are 52 or less bits in the long integer, and do something else
when more.


There is a real problem with the rule that objects which compare equal must
have equal hashes because it promotes really bad hash algorithms. For
example, the hash for floats in IBM Smalltalk, and I think in Squeak before
I 'fixed' it, is/was:

   hash
      self truncated hash

This means that only the integer part of the float is considered, making
fully half of the range of a floating point number answer 0, one-quarter
answer 1, one-eighth answer 2, etc.

(Why? since exponents can be negative, and have as big a negative range as
positive range, then fully half of them have a zero 'integer part'.)

But, by golly, integers and floats answer the same hash using such an
algorithm!

Hashes are used mainly in various hashed collections. The rule is present
so that a mixture of objects can be present, like integers and floats. The
collection hashes the lookup value, then probes an indexed collection with
that hash. It then starts a sequential (yuck!) search at the point using #=
to compare the object with the contents.

With (1 hash = 1.0 hash) being false, an entry in the collection made as
1.0 would probably not be found if looked up as 1. I find this less
bothersome than having half of all float hashes answer the same value as (0
hash).


Another problem is that some objects change the internal values on which
they hash. For example, an Employee object may hash on the employee name,
but the name can change. The Rule makes this illegal, but what do you do?
Assign a unique value to each object just for hashing. (Actually, this is a
bad example since employees usually have several such 'unique' numbers, but
the problem is real.)


Most hash algorithms are way too simple. Take, for example, the hash for
strings in Squeak:

hash

   | l m |
   (l _ m _ self size) <= 2
      ifTrue:
         [l = 2
         ifTrue: [m _ 3]
         ifFalse:
            [l = 1
               ifTrue: [^((self at: 1) asciiValue bitAnd: 127) * 106].
               ^21845]].
   ^(self at: 1) asciiValue * 48 + ((self at: (m - 1)) asciiValue + l)

For a length of zero it answers 21845. For a length of one it forces the
character into the range 0-127, then multiplies it by ($j asciiValue). For
lengths above 3 it executes the last line which uses only the first and
last-1 characters. For length of 2, it executes the last line using the
first and last characters.

So, basically it looks at most at two characters. This makes for quick
hashes, but poor hashing. What one saves by doing a poor hash one then
probably throws away by having a lot of values hash to the same spot in the
collection. This forces the collection to search (and search and search)
sequentially looking for an empty spot.

I still vividly remember the comparisons I made between Dictionary and the
then new LookupTable in IBM Smalltalk V3. LookupTable was supposed to be
(and actually is) much faster since it doesn't use associations. So I wrote
a test case using random numbers as keys and put 10000 entries in instances
of each kind of class. I then did a lot of lookups. Both cases took almost
forever, and LookupTable was not faster.

It was then I discovered that ALL of my hash values were zero, since all my
random numbers were in the range [0.0-1.0). Multiplying the random number
by some suitable constant made a big difference.


I think this whole problem needs a fresh look. Anyone have any thoughts on
how to have reasonably fast yet reasonably good hashes? Should 'the rule'
be changed? How?

Dave

_______________________________
David N. Smith
IBM T J Watson Research Center
Hawthorne, NY
_______________________________
Any opinions or recommendations
herein are those of the author
and not of his employer.





More information about the Squeak-dev mailing list