Re: Shouldn't 2 hash = 2.0 hash? [LONG] - Squeak-dev

29 Nov 1999

      At 17:33 -0600 11/28/99, R. A. Harmon wrote:
...
I found the following:
    2 = 2.0 -> true
    2 hash = 2.0 hash ->false

I think this is a bug.  Is it?
Good question. The blue book says:
"... Any two objects that are equal must return the same value for 
#hash. Unequal values may or may not return equal values for hash. 
Typically, this integer is used as an index to locate the object in 
an indexed collection..."
[Pg 96].
(I cannot find any reference in the blue book to another hashing 
rule: The hash value of an object must be constant over time. There 
is yet another hashing rule that is more practical: Hash values 
should be small integers.)
So, by the Blue Book definition, what you observe is a bug.
Should it be? Does it really mean that two instances of wildly 
differing classes must answer equal hashes when they compare equal? 
Why does the Blue Book make this restriction?
All of the discussion in the Blue Book about hashing is in the 
context of looking up values in a hashed collection. Did they intend 
that someone might put a float 2.0 into a hashed collection and then 
look it up with an integer 2? Do people do it?
Maybe there is some other reason for this rule, but I think it might 
just be incompletely stated. Consider these cases:
* An employee object compares equal using the employee number. Should 
(anEmployee = 899538) answer true or purposely hash the same? I think 
the hashes can be safely different. The two classes are quite 
separate in intent and use, and the fact they compare equal is an 
accident.
* In another implementation, employee objects are represented as 
instances of classes in a hierarchy. Should aRetiredEmployee compare 
equal to anActiveEmployee if, by accident, they have the same 
employee number? Sure! Should they have the same hash value? Of 
course. The classes here are closely related and should follow the 
rule. In fact, making a new employee and then trying to find it in a 
dictionary of existing employees would find a nasty bug.
So, maybe the rule should state something about closely related 
objects answering the same hash, but I don't see a good and simple 
way to say it.
Are subclasses of Number closely related enough that they should 
follow the rule?
Most Smalltalk systems do what you expect.
In order to keep hash fast, IBM Smalltalk simply truncates all floats 
to integers and use the hash of that. It's an ugly solution, IMHO. 
That means, for example, that a dictionary which uses floats as keys 
may have just a few unique hash values for a large number of entries; 
it can end up doing a sequential search instead of a simple probe 
with the hash.
One could make the hash operation more complex, so that any non-zero 
fractional part is used in the hash, but then the hash operation 
slows down. Many systems do that and have a slower floating point 
hash, harmful or not.
There are similar problems with scaled decimal values.
So, should the rule be followed even if it's more expensive to do so? 
Are subclasses of Number closely related enough that they should 
follow the rule regardless? Are the 'same' values represented as 
different classes ever used as equivalent keys in a hashed collection?
In my view, and assuming I'm not missing something obvious, the answer is no.
I'd argue that having (2.0 = 2) answer false is better than forcing 
the hashes to be equal. Besides, comparing floats is a sin one should 
not encourage. I'd rather see #= answer false and that some other 
method be used for 'has equivalent value'.
While I did write that floating-point hash method, Squeak didn't 
follow the rule before I wrote it, which is why I didn't make the new 
version follow it either.
But you did ask a good question, and I didn't really answer it. 
Anyone have a better answer?
Dave
_______________________________
David N. Smith
IBM T J Watson Research Center
Hawthorne, NY
_______________________________
Any opinions or recommendations
herein are those of the author
and not of his employer.