[BUG][FIX] WeakKeyDictionary>>keysAndValuesDo:

Thu Jun 17 04:06:03 UTC 2004

Martin Wirblat <sql.mawi at t-link.de> continues the debate:
	My take was that an equality #= which is in fact an identity #== may 
	make sense, in that it supports what I called the programmer's 
	intuition about equality.

Can you *please* stop talking about 'intuition'?  It is really important
that we all understand intellectually and emotionally that different people
have DIFFERENT intuitions and that appeals to "the programmer's intuition"
are entirely devoid of utility until you have specified *which* programmers,
*what* their background is (which causes them to have these intutions) and
*why* other people can be dismissed from consideration.

Any programmer who has an intuition that #= does or should mean the same
as #== has simply failed to get the point of having such a distinction.
(In Java, read '.equals(_)' for #= and '==' for #==.)

Let me repeat it once more:  if your intuition is that the relevant form
of "equivalence" in some specific context is OBJECT identity, Smalltalk
supports you by offering #== and #identityHash as building blocks and
IdentityDictionary (plus IdentitySet and IdentityBag in Squeak) as suitable
collections.

Nobody is saying that #== should not be available for Sets.  It is, and as
long as Smalltalk lasts, it will be.  If you want to use identity, nobody
is stopping you.

However, there are lots and LOTS of problems for which equivalence of STATE
is the appropriate concept, and no amount of wittering about some
programmer's intuition is going to change that fact.  For example, two
LargePositiveInteger objects may be physically distinct, but if they
represent the same value, we want to know that.  Smalltalk provides #= for
comparing equivalent STATES, where each class gets to define the appropriate
notion of equivalence.

The default implementation of #= is indeed #==, so if a programmer doesn't
bother to think about what equivalence *should* mean for a class, the
default is not too horribly broken.  But any competent programmer SHOULD
think about equivalence for his new classes and should take seriously BOTH
the possibility that the default implementation might be right AND the
possibility that it might not.  To simply assert that identity is always
what the programmer's "intuition" requires is to avoid the genuine design
work that is a programmer's job.

Let me take a quick tour through a Smalltalk other than Squeak, looking at
the classes that define #=

    Object		The == default.
    IndexedCollection	Basically the <sequence readable collection> protocol
			Compares STATES not IDENTITIES.
    Interval		Fast special case algorithm for IndexedCollection
			definition.  In this Smalltalk, Intervals are mutable
			(they offer #removeFirst and #removeLast) and this
			definition compares STATES not IDENTITIES.
    String		Compares STATES not IDENTITIES.
			Ensures that aString = aSymbol is false.
    Symbol		Reverts to == but state <-> identity for Symbol.
    FileSystemComponent	Compares full names.  File objects actually hold
			file system references.  Full names are determined
			by asking the operating system for the name *each*
			time you want them.  So this compares STATES (as
			mapped through file system state) not IDENTITIES.
    Magnitude		self subclassResponsibility.  This clashes with my
			reading of the ANSI standard.
    Association		Recursively applies #= to keys and values.  This
    			compares (part of) STATES not IDENTITIES.
    Character		Compares states but state <-> identity for Character.
    DateAndTime		Compares STATES not IDENTITIES (as ANSI requires).
    Duration		Compares STATES not IDENTITIES (as ANSI requires).
    Number		Not only compares STATES rather than IDENTITIES, but
			goes out of its way to answer 'true' for objects of
			completely different classes (as ANSI requires).
    Float		Special case of Number>>=.
    SmallInteger	Special case of Number>>=.
    Point		Compares STATES not IDENTITIES.  (Points are mutable.)
    WeakValueAssociation see Association
    MethodReference	Compares STATES not IDENTITIES (objects are mutable).
    NativeAbstractPointer  Compares STATES not IDENTITIES
    MethodNode		Compares STATES not IDENTITIES (objects are mutable).
    Rectangle		Compares STATES not IDENTITIES (objects are mutable).
    FixedOffsetTimeZone Compares STATES not IDENTITIES
    ConstantVariable	Compares STATES not IDENTITIES (objects are mutable).
    InstanceVariable	Compares STATES not IDENTITIES (objects are mutable).
    StaticVariable	Compares STATES not IDENTITIES (objects are mutable).

I wonder if a pattern is beginning to emerge here?
Can we begin to discern what it might be?

In that particular Smalltalk, the rule for collections is
- if it's a sequence, #= is explicitly defined to compare states
- if it's a set, bag, or dictionary, #= is not explicitly defined,
  because the ANSI standardisers forgot to say anything explicit about it.

	Let's consider a Class "Database".  The programmer has a
	database and adds 1 record to an existing myriad records.  Would
	he intuitively think that his database is not the same anymore?

It he is a COMPETENT programmer, he knows that it is the same OBJECT but
it is in a different STATE.

	That it can't be found in a Set anymore?

If he is a COMPETENT programmer, he doesn't put ANY mutable objects in Sets
unless he has made sure that they will NOT be changed while they are there.
If a COMPETENT programmer needs a set of mutable objects which he wants to
distinguish consistently whatever their states, he uses an IdentitySet,
not a Set, precisely because he knows the difference between #= and #==.

	I guess many would say it is the same, it just changed its
	contents.  Otherwise I would get every time I added a record a
	new database.

See?  You agree!  It's the same OBJECT in a different STATE.  And if you
look at what #= is for, it's for comparing STATES.

	However, I understand that your position has its merits, too.

I wish I could be polite and say the same, but I can't.  My position is
that Squeak provides BOTH #= and #==, BOTH #hash and #identityHash,
BOTH #includes: and #identityIncludes:, BOTH Set and IdentitySet,
BOTH Dictionary and IdentityDictionary, and that a competent programmer
can be expected to know when each is appropriate, AND when designing his
own classes, to consider (and ideally, document, but we all know about
Squeak documentation) why he made the choice he did about #=.

Here's where I am coming from:  several times a year I need a set of sets,
where it is important for me to know whether a particular set *state* has
already been recorded.  In a Smalltalk where Set>>= is the default version
inherited from Object, I have to make up a new name, call it #equals:.
There is then a bug generator constantly at my heels waiting to bite me,
the temptation to write = instead of equals:.  Why should I be forced to
come up with a new name for equality, when we ALREADY have a suitable
notion in the language that has been carefully distinguished from identity?

Instead of considering imaginary classes like Database, let's consider a
real class: Point.

    |p s|
    p := 1 at 2.
    s := Set with: p.
    p x: 77.
    s includes: p
===>false

Why is it OK for changing a Point's state to make a Set lose track of it,
but not changing a Set's state?

	Here is a little snippet of code, which computes the number of classes 
	which do have state, but are not redefining #= to depend on that state.

(I've moved the code to this point so that we can all see what we are
talking about.)

	| classes |
	classes := Smalltalk allClasses select: [ :cls |		
		(( cls allSuperclasses add: cls; yourself ) select: [ :c |
			c includesSelector: #= ]) size = 1 ].

If I've got this right, this is meant to find all the classes which
inherit their definition of #= from Object.

This could be done somewhat more efficiently by

    meth := Object lookupSelector: #= .
    classes := Smalltalk allClasses
		select: [:c | (c lookupSelector: #=) == meth].

In my copy of Squeak, this finds 1617 classes out of 1792,
so only 175 classes define or inherit a different definition of #=.
Is this a problem?  Does this count against me?

Think about Morphs for a minute.  I don't really understand Morphic at all
well, and I _have_ read the books and tutorials.  But to the limited extent
that I _have_ understood it, it seems to me that no two Morphs ever _can_
have the same state:  they would have to be exactly the same size, shape,
colour, location on the screen, *and* location in z-order.  Having two
Morphs with the same state makes about as much sense as having two physical
objects in exactly the same space at exactly the same time.  So for those
393 classes, "same state" and "same identity" are the SAME question and might
as well have the same implementation.  That is, the fact that these 393
classes inherit #= from Object is, on my view, RIGHT.

MVC is still there.  That accounts for another 150 or so classes which
probably SHOULD use #== to compare state because no two objects in those
classes should have the same state.

Process stuff also typically has unique states.  Semaphore might seem like
an exception, but each Process can wait on at most one Semaphore, so we are
at least some of the time back in unique-state land.

The AbstractSound hierarchy is interesting.  Presumably it makes sense to
ask whether two instances of descendants of AbstractSound represent the same
sound.  These things can be pretty large (FMClarinetSound has a SoundBuffer
with 4000 elements) and some of the state might well change while one is
doing the comparison (msecsSinceStart, for example).  As for comparing
different kinds of sound, I'd rather not, if you don't mind, and it's
arguable that numerical considerations would make equality difficult to
ascertain.  Are two sounds, one of which contains a low frequency component
that people can't hear and the other doesn't, equal?  This is a case where
the feasibility and desirability of state comparison is doubtful.  It's a
case where I would expect a designer to think hard about the question and
then give up, but leave a comment (perhaps in the class comment) saying why.

	classes := classes select: [ :cls |
		cls isVariable or: [ 
			cls instVarNames isEmpty not or: [
			( cls allSuperclasses detect: [ :c | 
				c instVarNames isEmpty not ] ifNone: []) notNil ]]].

If I've got this right, this further restricts attention to classes which
have instance variables or elements or both.  What we really want to filter
out are abstract classes (which don't have direct instances, so =/== is moot)
and singleton classes (where = and == must coincide). Note that
ProportionalLayout is one of the classes that's filtered out this way,
but it's not an abstract or singleton class.

We can do this a bit more efficiently by

    classes := classes select: [:c |
      c isVariable or: [0 < c instSize]].

There are 78 classes in all lacking instance variables and elements.

Boolean, True, False, and UndefinedObject are filtered out.  For these
and other singleton classes, "same identity" and "same state" coincide,
so for singleton classes using #== for #= is a CORRECT way to compare states.

Some abstract classes (Object, LayoutPolicy, ...) are filtered out.  So
are a bunch of classes being used as constant pools; using #== for #= is
also ok for no-instance classes.  Utilities is a no-instance class.

Interestingly, one of the classes that is filtered out this way is
BorderStyle, which DOES define #=, and DOES define it to compare STATES
not IDENTITIES (hey, maybe that's the pattern we're looking for!).

Also interestingly, one of the classes that is filtered out that way
is Collection, which redefines #hash (to depend on state, of course)
but does NOT redefine #= to match.  This is bad news for Bag, which
inherits #= from object but #hash from Collection, and the two DON'T
play together correctly.  This is in Squeak 3.6#5429.

Still poking around in the rejected classes, take a look at DisplayMedium.
The only suclass of DisplayMedium is Form.  Now a form is "a rectangular
array of pixels", and it would indeed make sense if #= for Forms meant
"is the same picture".  (I trust noone will dispute that such an operation
is both definable and potentially useful.)  But some subclasses of Form
represent physical objects or external computational objects.  So there
may be some problems in defining #= the way we want, but we have at least
discovered that the question "do these Forms currently hold the same
picture" is a meaningful question which Squeak provides no direct answer to.

Continuing to poke around, we find

    DisplayTransform
      CompositeTransform
      IdentityTransform
      MatrixTransform2x3
      MorphicTransform

These represent affine transformations in 2D space.  As such, they can
all be converted to MatrixTransform2x3.  Interestingly enough,
MatrixTransform2x3 DOES implement #=.  And it compares STATE not IDENTITY.
Now I for one find it very odd that you can have two display transforms
which you *know* to represent the same transformation but you can't ASK
whether they do.  And it's very odd that you CAN ask whether two
MatrixTransform2x3 objects are equal, but you CAN'T ask whether two
IdentityTransform objects are equal.

Would anything very bad happen if
    DisplayTransform>>

    = aDisplayTransform
      ^(aDisplayTransform isKindOf: DisplayTransform) and: [
       self asMatrixTransform2x3 = aDisplayTransform asMatrixTransform2x3]

    hash
      ^self asMatrixTransform2x3 hash

Put it this way, if #= is right for MatrixTransform2x3, it's wrong for the
others, and conversely.  By examining this question, we have indeed found
a place where Squeak gets it, um, inconsistent.

Now let's look at Stream.

    Stream
      AttributedTextStream
      DataStream
        ReferenceStream
          SmartRefStream
      DummyStream
      FlashFileStream
      HtmlTokenizer
      MailAddressTokenizer
      ObjectSocket
        ArbitraryObjectSocket
        StrinsgSocket
      PositionableStream
        ReadStream
          ChessMoveList
          InflateStream
            FastInflateStream
              GZipReadStream
              ZLibReadStream
          JPEGReadStream              
        WriteStream
          DeflateStream
            ZipWriteStream
              GZipWriteStream
              ZLibWriteStream
          LimitedWriteStream
          ReadWriteStream
            FileStream
              StandardFileStream
                CrLfFileStream
                  BDFFontReader
                  HtmlFileStream
            RWBinaryOrTextStream
              RemoteFileStream
              SwikiPseudoFileStream
            Transcripter
          TextStream
            DialectStream
          TranscriptStream
          ZipEncoder
        SocketStream

We have here 41 classes.  They actually provide a beautiful example of why
Martin Wirblat's code snippet provides misleading answers.  ReadWriteStream
*does* define #= (and #hash) so as to compare states (same position and
same contents).  This contributes 10 classes to the "doesn't use #==" set.
However, the definition is (equivalent to)

    = other
        ^(self class == ReadWriteStream and: [other class == ReadWriteStream])
          ifFalse: [super = other]
          ifTrue:  [self position = other position and: [
                    self contents = other contents]]

which means that the 9 descendants of ReadWriteStream should be counted as
using #==, which they aren't.  We have an inconsistency here:  I would
expect ReadStream to have a similar definition of #=, and it doesn't.  This
is at least inconsistent.  I suspect that one day someone needed stream
equality but happened to be using ReadWriteStreams at the time and didn't
think about ReadStreams.  (This method was added in late 2001 by 'tk';
would 'tk' care to discuss it?)

Some of these streams are connected to file system objects in such a way
that you expect there to be at most one object with any given state, and
in such a case using #== for #= is CORRECT.  Some of these streams contain
other streams, their state depends on the state of those streams.
TranscriptStream is connected with graphics:  the association is actually
maintained outside the TranscriptStream object itself, but the "physical"
uniqueness of graphical objects applies to transcript streams.  This is a
useful reminder that the state required to decide on equality might not be
inside the object itself; it might be held in some other object.

Anyway, the point here is that on my criteria, most of these classes
(with the exception of ReadStream) have the right definition of equality.

	For 3.6 it evaluates to 1136 out of 1338 classes. The majority of them 
	is ignoring your "platonic" equality.

Please drop this word "platonic".  The notion of equivalence we are
discussing has nothing to do with that evil old proto-totalitarian.
My concerns are strictly pragmatic:  what is a USEFUL way to think about
equivalence of Sets?

	Is Squeak grossly constructed 
	falsely regarding equality, had it just been forgotten to do it right 
	or is this an unsolvable problem?

No, what it really means is that your code snippet doesn't actually tell
us anything useful.  I do *NOT* say that every concrete class should have
a definition for #= that differs from #== .  In particular,

    - abstract classes should have whatever definition is most useful
      for their descendants, but it _could_ correctly be anything.

    - no-instance classes can use anything, but #== is fine.

    - singleton classes should use #==.

    - classes where each state has at most one object holding it
      should use #==.  This includes classes like Character and SmallInteger.
      It also includes objects which are connected to or simulate the
      physical world in such a way that there cannot (or should not) be two
      objects with the same state.

      The XML Document Object Model provides another example where each
      object has a unique state.  Even a piece of text is locked tightly
      into a specific location; it is quite impossible for two DOM objects
      to have the same state, so #= *should* be #== for DOM objects.

    - Objects that represent "mathematical" values (such as Numbers, Points,
      Rectangles, Colors, ColorMaps, Arrays, ...) should implement #= to
      test the state rather than the identity.

    - it is at least inconsistent that Point and Rectangle DO define #=
      "mathematically" but Line does not.  Consistency is better than
      inconsistency.

    - it is inconsistent for some kinds of Collection to define #=
      "mathematically" but others not.  (For example, whatever you think
      Bag and Set should do, it is hard to think of a good argument for
      having them do *different* things.)

    - There are classes where there is a clear notion of "equivalence" but
      it is difficult or impossible to implement.  Determining that two
      blocks represent equal functions is, for example, provably impossible.
      In such a case, it would make sense to first determine whether two
      objects couldn't possibly be equal (they aren't both blocks, or don't
      have the same arity) and return false, calling #shouldNotImplement
      only when the "impossible" test must be carried out.  AbstractSound
      may belong in this group (function from time to sound?).

    - ANSI Smalltalk requires that (x = y) have the same value as (y = x).
      This requires some care in design.

    - system private classes do not have to follow public protocols.
      (Squeak doesn't really have a way to mark system private classes yet,
      but if I don't understand it, I treat it as a system private class,
      and that takes care of most of those thousands of classes.)

	Or does it simply show what the 
	average Smalltalk programmer intuitively thinks about equality?

It certainly doesn't do that.  Squeak wasn't built by average programmers,
not even by average Smalltalk programmers.

	In that case the current mixed view of equality (state or
	identity for #=) in Smalltalk may fit to the language Smalltalk.

I have no idea what this is supposed to mean.
The current state of affairs in Squeak is that

    - many classes *CORRECTLY* test for equivalence of state quickly
      using #==

    - many classes would be system private classes if Squeak had such
      things (too bad about 3.3) so most of us have no business caring
      what they do about #=

    - many classes were obviously thrown together with insufficient care
      (just look at the documentation ...)

    - you don't have to look very far before you find inconsistencies
      (like Bag inheriting incompatible definitions of #= and #hash)

	And that means that Squeak Sets very well could be implemented
	as they are in other Smalltalks.

No, it doesn't mean that.  In fact, if you look at the collection classes
in Squeak and compare them with other Smalltalks, you soon realise that
Squeak has pretty wide coverage (the other Smalltalk I analysed for this
message does not have PluggableSet, for example, nor does it have the
#noneSatisfy: method) AND Squeakers have fixed quite a lot of bugs.

Let me offer you an example from the ANSI Smalltalk standard.
It's an operation that Squeak 3.6#5429 doesn't happen to implement.

<sequencedReadableCollection>
5.7.8.19 Message:    from: start to: stop keysAndValuesDo: operation

Synopsis
    For those elements of the receiver between positions start and stop,
    inclusive, evaluate operation with an element of the receiver as the
    first argument and the element's position (index) as the second.

Definition <sequencedReadableCollection>
    For each index in the range start to stop, the operation is evaluated
    with the index as the first argument and the element of that index as
    the second argument.

The synopsis calls for
    start to: stop do: [:i | operation value: (self at: i) value: i].
The definition calls for                      *********************
    start to: stop do: [:i | operation value: i value: (self at: i)].

You can't do both.  This is inconsistent.  The "Smalltalk other than Squeak"
that I analysed above has rightly ignored the synopsis, ending up with a
definition which is consistent with #keysAndValuesDo:.  You have to use a
bit of common sense and head for consistency in cases of doubt, and it
would be hard to persuade me that having Array>>= depend on the state but
Set>>= ignore it can be described as anything other than inconsistent.