[BUG] Dictionary equality test

Richard A. O'Keefe ok at cs.otago.ac.nz
Mon Jun 17 06:28:09 UTC 2002


Chris Muller <afunkyobject at yahoo.com> wrote:
	I find it disappointing that = was implemented so liberally in
	Squeak.  = should only be implemented when there is absolutely
	NO ambiguity in the meaning of = with respect to that class.

There is no such animal.

	In most cases, we see that only classes that represent a measurable
	or "magnitudinal" value of some kind, (i.e., Number, Color,
	Date, Pitch, etc.)  are suitable candidates.
	
There are at least two sensible definitions of '=' for numbers.
(Hmm, make that four.)

For example, should 1/2 = 0.5 be true or false?
To me it seems obvious that these things are not just different
objects, they have different *numeric* behaviour, so it should be false.
If you divide (1/2) by (1/2 - 1/2), you should get a division by zero,
but if you divide 0.5 by (0.5 - 0.5) you should get a well-defined
result (according to the IEEE specification).

It so happens that Squeak treats 0.5/0.0 by raising a ZeroDivide exception,
which makes Squeak NOT the language for choice for testing IEEE-arithmetic
algorithms.  (Sigh.)  But we can demonstrate another difference in
behaviour:

    (0.5 raisedTo: 3000) = (1/2 raisedTo: 3000)

comes out false.  Now #raisedTo: when applied to numbers does not change
them in any way, it computes new numbers.  It's a function.  And one of
the fundamental laws about functions is that if x = y, then f(x,z) = f(y,z).

The definition is Squeak violates that fundamental law.

What was that about "absolutely NO ambiguity in the meaning" again?

If we adopt that rule, then equality in Number has to GO.

As for Color, I note that
    r := Color red.
    x := r alpha: 2.0.
    r red = x red and: [r green = x green and: [
    r blue = x blue and: [r alpha = x alpha]]]
comes out true, and
    r = x
comes out true, but
    r hash = x hash
comes out false, so here we have two "equal" objects which will look
different to a Set.

Clearly Color>> = has to GO as well.

Date I'll buy.
Pitch?  There is no Pitch class in my copy of Squeak, so I can't comment.


	Sets, Dictionaries and Customers are very poor candidates.

I have no idea what Customer is; there is no such class in my copy of Squeak.

The notion of equality for Sets is surely unproblematic?
Crack open any set theory textbook, and it'll be somewhere around page
3 of chapter 1.  It's one of the fundamental rules of set theory.
(See 'The Joy of Sets' for a good introduction.)

    s1 = s2 if and only if for all x: (x in s1) if and only if (x in s2).

So when we find in Set the following definition:
    = anotherSet
	(aSet isKindOf: Set) ifFalse: [^false].
	self size = aSet size ifFalse: [^false].
	self do: [:each | (aSet includes: each) ifFalse: [^false]].
	^true
we can tell at once that this is a reasonable interpretation of the
unique mathematical definition of set equality.

As for dictionaries, they also are models of extremely well understood
mathematical objects.  A Dictionary is a finite map.  Ask any mathematician
what it means for two finite maps to be equal, and s/he'll say
    m1 = m2 iff
	dom m1 = dom m2
    and for all x in dom m1: m1(x) = m2(x).

Not surprisingly, we find this definition in other programming languages.
For example, on pp 108-109 of Common Lisp, the Language, 2nd edition:
    Two hash tables are considered the same ... if and only if
    they satisfy a four-part test:
	* They must be of the same kind ...
	* They must have the same number of entries ...
	* For every entry (key1,value1) in one hash table there
	  there must be a corresponding entry (key2,value2) in
	  the other such that [... this kind of table thinks key1 = key2]
	* For every entry (key1,value1) in one hash table and
	  its corresponding entry (key2,value2) in the other, such
	  that key1 and key2 are the same [that is, are =], [equality]
	  must be true of value1 and value2.

In fact, trawling through several languages and more data structures
packages than I ever wish to see again, I can only find THREE approaches
to Dictionary equality at all:

(a) Don't impleement it.
    This is in effect what Squeak has now.  There is a definition, but
    it was inherited, and doesn't actually work.  It doesn't seem that
    anyone thought about Dictionary equality.

(b) Make it the same as ==, on the grounds that two objects are equal
    only if they respond the same to all messages, including changes.
    But we already have the #== selector, so we are free to make #=
    do something more interesting.

(c) The standard mathematical definition.
    This is my proposal.

I cannot find Chris Muller's proposal used anywhere.

	The very fact that we have posted our different interpretations
	of how we think Dictionary>>= should behave proves it.
	
No, it just proves that one of us is wrong.

	Because of this, Dictionary>>= should inherit the identity-check from Object>>=
	and you should use the more intention-revealing method, includesAllOf:.
	
But (1) Dictionary CANNOT inherit the identity check from Object,
because there is another definition in the way, and (2) #includesAllOf:
is about as far from being intention-revealing as you can possibly get.
(2x) it compares only the values, and my whole point is that Dictionary
equality to be correct MUST examine both keys and values, and
(2y) considered as a set operation, it's still wrong, because it expresses
set INCLUSION, not set EQUALITY.

This is like someone saying "don't use #+ for addition, use the
intention-revealing method #raisedTo:."

	Unfortunately, includesAllOf:  "incorrect" for Bags in my view,
	because it doesn't account for the occurrencesOf:  each
	instance, it only requires one.

Right.  The definition for Bags should be the same as the definition
for Dictionaries should be:  two Bags should be equal if and only if
they are equal considered as finite maps from their domains to the
positive integers.

	Yet another example of the importance of thinking about the
	namespace we consume when we extend a base class.
	
Indeed.  But when there is a unique mathematical definition that applies
and it CAN be implemented, there is no excuse for doing something different.

	Incidentally, there are other GOOD reasons to avoid using =
	besides this semantic one.  Consider the following:
	
	1.  slower execution performance for comparisons, adding and
	finding in hashed collections.
	
Slower than WHAT?  If we want == comparison, we have it.  The alternative
to having a correct = in Dictionary is forcing everyone to compare it
themselves.  As for hashed collections, the whole reason for me raising
this point in the first place is that I *wanted* to have a Set of
Dictionary objects AND IT DIDN'T WORK.  The change I propose is necessary
to make this work.  Yes, this is slower than using an IdentitySet of
of Dictionaries, but then anyone who wants an IdentitySet of Dictionaries
can still have one.

	2.  slower development performance:  implementing = also
	requires that you implement #hash.

All I am proposing is that Dictionary should have a definition for #=
that WORKs the way everyone I've asked expects it to work, instead of
the current definition which doesn't do anything useful.  Yes, a
corresponding definition of #hash is required.  Here's one that will do:

    hash := self species hash.
    self size <= 10 ifTrue: [
	self do: [:elem | hash := hash bitXor: elem hash]].
    ^hash bitXor: self size hash

This definition ignores the keys, but that's OK, and it ignores
the order of the elements, but that's ESSENTIAL.  And guess what,
this is the definition of Dictionary(Collection)>>hash already!

So there is not the slightest requirement that I implement #hash;
it already exists in a perfectly usable form, completely compatible
with my proposed definition of #= .

I _have_ developed (and posted) a working definition of #= for
Dictionaries, so the future development time is now zero.

	3.  Nightmare bugs:  any time you change an objects attribute
	that participates in the determination of equality (and, thus,
	the determination of its hash), every hashed collection that
	references that object must be rehashed.

At the moment, we can't HAVE a Dictionary in a hashed collection
because #= doesn't do anything sensible.  So there is no working
code that can be broken by this change.

This is a known limitation of Smalltalk Set and Dictionary objects,
and it does not in the least stop us putting Strings in sets and
using Strings as keys, even though they are in principle mutable.

	This can be a nightmare to debug, even for experienced
	developers, because it isn't always easy to know all of the
	hashed collections that need to be rehashed.
	
Yes, but you are arguing against putting ANY mutable objects in
Sets, Bags, and Dictionaries, including Points, Strings, and Sets.
We are ALREADY allowed to have a Set of Sets and a Dictionary of Sets.
Allowing a Set of Dictionaries introduces no problems that we don't
already have, it just removes the problem that one of the obvious
combinations doesn't work.

	4.  The cost of class schema changes is increased.  When you add
	a variable to a class, you have to remember to evaluate the =
	and hash methods.
	
I have no idea what you are talking about.
Dictionary exists.  It has a broken method.  I want it fixed.
I am not adding a new variable to any class here,
and what I propose does not create ANY problems Squeak doesn't already have.

	5.  Finally, it encourages relational "data" thinking instead of
	object thinking where Flyweights are part of everyday life.

No, it encourages "value" thinking.  The kind of thinking that DOESN'T
run into those problems you mentioned earlier.

Are you seriously arguing that Set and Dictionary should not exist at all?
Given that they DO exist, what is wrong with their combinations working
correctly?

	A system that does not share objects, (but instead creates
	multiple instances with the same "values" populated in its
	variables) will consume a lot more memory than one that can
	properly share.

This is completely irrelevant to what I am proposing.
I am a great fan of shared data structures.
Not one thing I have suggested in this thread would in any way
prohibit sharing, it would just mean that the shared objects
WORKED in more cases.

The plain fact of the matter is that I had a problem for which
a Set of Dictionaries was the most natural representation of the result.
In order to remove duplicate dictionaries some other way, I would
still have needed (guess what?) a working version of equality to
tell whether they were duplicates or not.




More information about the Squeak-dev mailing list