Philosophical questions about collections

Wed Jul 11 00:47:32 UTC 2001

	It's a fine line.  You say that + and = should behave the same for equal
	numbers, but that at:put: doesn't have to for equal collections.

Why is it fine?
We have two notions in Smalltalk:
    #== depends on the identities of objects (not their states)
    #=  depends on the states of objects (not their identities)
A query is something that probes the (abstract) state of an object
without changing what that abstract state is.

Surely it's unproblematic in Smalltalk that
    x := 'abc' copy.
    y := 'abc' copy.
creates two objects which have the same state but different identities.
After these steps,
    x = y
is true, but if you then do
    x at: 1 put: $c.
you'll find that
    x = y
is now false.

Equality is a state-dependent operation.  Since behaviours that alter the
(abstract) state are going to affect equality, it makes sense to exclude
them from the set of probes we consider when defining equality.  (Defining,
not implementing.  I don't care what you do when implementing equality.)

So I exclude #at:put: from consideration when asking whether
    #(1 2 3) = (1 to: 3)

	You need a complicated definition of "query" to make this
	statement hold for both disallowing "5 = 5.0" and allowing
	"#(1 2 3) = (1 to:  3)".  In my view, the way you define "query"
	versus "modification" is going to depend on individual cases.

I don't think you were reading carefully.  I went to some trouble to
show that my criterion *DOESN'T* allow #(1 2 3) = (1 to: 3).
Interval has 4 selectors that Array doesn't, two of which are queries.
Array has 14 selectors that Interval doesn't, at least four of which
are queries.  They could have more of their selectors in common than they
do.

	Change of state isn't a good definition.

Since abstract state is what equality depends on, it's not just a good
definition, it's the ONLY definition.

	For example, toggling a button
	object is a change of its state, but it probably isn't a wholesale
	modification.

Neither is changing one character of a String "a wholesale modification",
but it is quite enough to affect equality.

	It would be irritating if a regular button and an
	OnlyOnButton were treated as #=.

A Button is a kind of Switch.  A Switch is a kind of Model with
'on', 'onAction', 'offAction'. and 'dependents' in its state.

If a Button and an OnlyOnButton have the same state, what precisely is
the problem with regarding them as #= ?  If you cared about whether they
would respond the same to attempts to change the state, you'd use #== .

	On the other hand, changing the block
	that fires when a button is activated, would seem like a wholesale
	modification.

If I change the block from
	[nil]
to	[|x| x := nil]
is that _really_ a wholesale modification?  The distinction between
some change/ no change is pretty sharp.  The distinction between
minor change/ wholesale change is pretty blurry.

	I would consider it okay for two blocks to be #=, but for
	one of them to have its activation block fixed.

In fact defining equality for blocks is difficult to impossible in general.
The Scheme definition dances very very carefully around it; Scheme wants
to guarantee that X=X, but doesn't really want to promise anything more at
all about equality of blocks.  SML, Haskell, Clean, and Mercury both take
care to forbid it entirely in their type system, and O'CAML rejects attempts
to compare functions at run time.

	> 	In general, we both agree that #= needs to be defined case by case, and
	> 	that it's often arbitrary what exactly #= means.
	> 	
	> Well, that's not quite what I meant.  What I meant is that there are
	> many *different* application-dependent notions of equality, and in the
	> absence of a strong mathematical theory to define what #= should mean,
	> it shouldn't mean *anything*.  For example, should a set and an array
	> that happens to have the same elements be regarded as #= ?  No, there's
	> a different relation #hasSameElementsAs: which should be named and used.

	Err, yes.  So what do you think of having #= at all??  If you think it
	is okay, then how do you define it?

#= should mean equality of abstract states.
Whether that is a well defined concept depends on the classes under
consideration.  It _is_ well defined for things like Booleans, characters,
numbers, strings, geometrical figures.  For "interaction" objects (a very
fuzzy concept, but I have in mind things like Morphs, sockets, open files,
in general things that are thought of in terms of streams of events) it's
difficult to see anything weaker than identity making much sense.

In other cases, it's not so clear.

For containers, there are two cases:
=/== containers have same shape and corresponding elements are #==
=/=  containers have same shape and corresponding elements are #=

For things like IdentitySet, it's clear that =/== is the one we want;
for things like Set, it's clear that =/= is the one we want;
it follows that it doesn't make a lot of sense to ask whether a Set
and an IdentitySet are #=, because each of them wants to compare the
elements a different way.

	It seems to me that we want some general notion of "same as".  I don't
	see how to come up with a "same as" notion that covers all possible
	classes, and so it needs to be defined in individual cases.

You are concentrating one one dimension of the problem.  It has two.
The notion of similarity one needs doesn't just depend on what kind of
object one is looking at, but what one intends to DO with it.

Let's just consider Array.  You affirm, and I deny, that there is a "same
as" notion that covers Array.  I say that =/== and =/= are BOTH sensible
notions of "same as" for Array, and that for some applications you want one
and for other applications you want the other.

There are of course infinitely many equivalence relations one might define
on Arrays.  For example, is-a-permutation-of is a perfectly good equivalence
relation.

My position has two sides:

 - For people defining = on a class:
    Consider whether there is some strong mathematical theory that yields
    a preferred notion of equality.
    Look for alternative interpretations.  DOCUMENT VERY CLEARLY which
    interpretation you have chosen.  Explain why you think that is the
    best one.  Leave a pointer to your notes about the alternatives.
    If a class represents "interaction objects", leave #= as #==.

 - For people wanting to use = :
    Consider very carefully what YOU mean by = in this context for this
    purpose.
    Check the documentation of relevant classes to see whether they
    interpretation they use is the one you want.
    If it isn't, craft a name for the equivalence relation you need,
    and implement it in the classes you need it for.  (This is where
    Smalltalk really shines.  Not being able to add methods to existing
    classes really warps code in most OO languages.)

	> 	A fine default definition for #= in Collection, IMHO, would be:
	> 	
	> 		= aCollection
	> 			self class == aCollection class and: [
	> 				self asArray = aCollection asArray ].
	> 	
	> That would be a lousy definition.  Consider this example:
	> 
	>     x := #('a' 'big' 'dog' 'chased' 'a' 'small' 'cat') asSet
	>     y := #('a' 'small' 'cat' 'chased' 'a' 'big' 'dog') asSet

	Yes, it's a lousy definition for Set's.  But is it bad as a *default*
	definition for Collection ?

As in worse than the current definition, which is inherited from Object,
of making collection equality be identity?

Yes, it is worse.  It's like giving a child a loaded gun.  Because there
are so many possible interpretations of equality, unthinking use of
equality in Smalltalk is very dangerous.

The ideal definition of #= in Collection is
    = aCollection
        ^self subclassResponsibility

It would remind people defining new subclasses of Collection that it is
THEIR job to figure out what equality means for THEIR data structure.

	I get the feeling that #species is an overburdened idea, but it does
	have a ring of "same as" floating around with it.  That's why I think #=
	might want to pay attention to it, even if it does open up possible
	problems.

If X>>species returns Y, to me that signifies that an X is (abstractly!,
not implementationally) a kind of Y.

	> 	Granted, as with comparing floats, it is dangerous in many cases
	> 	to use this method.  Is it really useful to ask whether, say, two Heap's
	> 	are #= ?
	> 	
	> Basically, a Heap is a Bag (bag union for heaps is called "merging")
	> plus a total order and an operation for removing an element with extreme
	> value.  Equality of heaps is the same as equality of bags.  

	... I disagree -- it's not so simple.  Heaps can have different sorting
	criteria, for beginners.

A slip.  I did say "plus a total order".  You are right, the total order is
part of the state.  So two heaps are equal if and only if they have the
same bag of elements AND the same total order.

	But even if they don't, two heaps might have multiple #= elements in
	them, and the elements might end up arranged in a different order.  Are
	these two heaps still #=?

Saying that the elements are #= means that we don't *care* what order they
come out in, we are choosing to regard them as indistinguishable.  If we
care what order the elements come out in, then we have to strengthen the
order to distinguish them.

	There's no one right Heap>>=, though maybe we could come up with a
	reasonable one.

Heaps are mathematical objects.  There is a unique definition of heap
equality which respects the heap laws.  I repeat, whether it's USEFUL
is another matter.  Personally, I doubt it.  I wouldn't even be surprised
if some other equivalence on heaps were more useful.

The big point here is that heaps *HAVE* a strong mathematical theory.
Morphs don't.

	> Whether it's useful is another matter;
	> one of my points is that IF #= is defined on heaps it should be defined
	> RIGHT.  This is one of the cases where there is a strong mathematical
	> theory that says uniquely what equality of values should mean.  If that's
	> not easily implementable, then have a definition that signals a
	> MeaningfulButNotImplemented error.

	AFAIK, in math there isn't general #= notion except for #==; if you mean
	something other than #==, you need a term for the particular case. 
	e.g., "these two systems are homomorphic", or "these two expressions
	have identical normal forms".  In math, it is quite informal to say
	things like "these two items are equal", and it would depend on context
	to what is meant.

There are, however, very clear guidelines as to what ISN'T equality.
To be meaningful as a reading of '=', a binary predicate should be an
equivalence predicate and satisfy
    x1 = y1 & ... & xn = yn => f(x1, ..., xn) = f(y1, ..., yn)
for all relevant functions and predicates f.

If you can find x and y such that x = y but not y = x, then what you have
is definitely NOT equality.  (Hint hint.)

	#= is for programmer convenience, it seems to me.
	I don't see a consistent way to define it system-wide, except
	with broad philosophical strokes.

Well yes.  You can say precisely the same about _every_ selector,
including #+.  We can't hope for universality, but we can hope for
consistency.

	If you use #=, it seems you simply have to know how the
	underlying objects will understand it.

We are in complete agreement there.

	For numbers, there are two or maybe three possible
	interpretations.

	For Arrays there is only one realistic one.

Oh no, that's quite wrong.  I've already mentioned =/== -vs- =/=.
To me both of this are realistic candidates for being "the" way to
test for equal arrays.  If you happen to have arrays holding characters
and/or strings, "same number of elements and corresponding elements are
the same ignoring alphabetic case" is a plausible third.  If you happen
to have arrays holding numbers, "same number of elements and corresponding
elements are within a specified tolerance" is one I've seen implemented
(in IBM Prolog), although I wouldn't accept it because it's not transitive.

	For collections in general?  Well, that's not so obvious!

Hey, we agree!  I go a step further, and say it isn't obvious for
_any_ collection.  That's why we have PluggableSet, for example.
I was really impressed when I saw that Squeak had PluggableSet and
PluggableDictionary.	

But they are a perfect illustration of why I don't think that

	> = aCollection
	>     self class == aCollection class and: [
	>         self asArray = aCollection asArray ].

would be a good default definition.  Two PluggableSets could be of the
same class and could report equal arrays, yet be different sets,
responding differently to #includes: .  The only collections for which
that would be a good way to define equality are the SequenceableCollections,
and SequenceableCollection already has a definition that works like that.