At 17:33 -0600 11/28/99, R. A. Harmon wrote:
I found the following:
2 = 2.0 -> true 2 hash = 2.0 hash ->false
I think this is a bug. Is it?
Good question. The blue book says:
"... Any two objects that are equal must return the same value for #hash. Unequal values may or may not return equal values for hash. Typically, this integer is used as an index to locate the object in an indexed collection..." [Pg 96].
(I cannot find any reference in the blue book to another hashing rule: The hash value of an object must be constant over time. There is yet another hashing rule that is more practical: Hash values should be small integers.)
So, by the Blue Book definition, what you observe is a bug.
Should it be? Does it really mean that two instances of wildly differing classes must answer equal hashes when they compare equal? Why does the Blue Book make this restriction?
All of the discussion in the Blue Book about hashing is in the context of looking up values in a hashed collection. Did they intend that someone might put a float 2.0 into a hashed collection and then look it up with an integer 2? Do people do it?
Maybe there is some other reason for this rule, but I think it might just be incompletely stated. Consider these cases:
* An employee object compares equal using the employee number. Should (anEmployee = 899538) answer true or purposely hash the same? I think the hashes can be safely different. The two classes are quite separate in intent and use, and the fact they compare equal is an accident.
* In another implementation, employee objects are represented as instances of classes in a hierarchy. Should aRetiredEmployee compare equal to anActiveEmployee if, by accident, they have the same employee number? Sure! Should they have the same hash value? Of course. The classes here are closely related and should follow the rule. In fact, making a new employee and then trying to find it in a dictionary of existing employees would find a nasty bug.
So, maybe the rule should state something about closely related objects answering the same hash, but I don't see a good and simple way to say it.
Are subclasses of Number closely related enough that they should follow the rule?
Most Smalltalk systems do what you expect.
In order to keep hash fast, IBM Smalltalk simply truncates all floats to integers and use the hash of that. It's an ugly solution, IMHO. That means, for example, that a dictionary which uses floats as keys may have just a few unique hash values for a large number of entries; it can end up doing a sequential search instead of a simple probe with the hash.
One could make the hash operation more complex, so that any non-zero fractional part is used in the hash, but then the hash operation slows down. Many systems do that and have a slower floating point hash, harmful or not.
There are similar problems with scaled decimal values.
So, should the rule be followed even if it's more expensive to do so? Are subclasses of Number closely related enough that they should follow the rule regardless? Are the 'same' values represented as different classes ever used as equivalent keys in a hashed collection?
In my view, and assuming I'm not missing something obvious, the answer is no.
I'd argue that having (2.0 = 2) answer false is better than forcing the hashes to be equal. Besides, comparing floats is a sin one should not encourage. I'd rather see #= answer false and that some other method be used for 'has equivalent value'.
While I did write that floating-point hash method, Squeak didn't follow the rule before I wrote it, which is why I didn't make the new version follow it either.
But you did ask a good question, and I didn't really answer it. Anyone have a better answer?
Dave
_______________________________ David N. Smith IBM T J Watson Research Center Hawthorne, NY _______________________________ Any opinions or recommendations herein are those of the author and not of his employer.
squeak-dev@lists.squeakfoundation.org