[BUG]Collection>>removeAll:

Richard A. O'Keefe ok at cs.otago.ac.nz
Fri Aug 23 01:08:39 UTC 2002


    [I wrote that I believe it is safe to make #removeAll: and #addAll:
     sometimes return a copy of their argument because it is implausible
     in the extreme that the identity of the result matters, whereas the
     state being right could plausibly matter very much.  I provided
     examples to demonstrate my point.]

Martin Wirblat <mw.projanynet at SoftHome.net> wrote:
	The assumption that the identity of the collection does not matter,
	and only the elements in it are important, is wrong.
	
I have provided ***EVIDENCE*** that hardly anybody cares about the result
AT ALL.  I checked every use of these methods in the system.  I've checked
some other Smalltalk code.  I have checked Smalltalk textbooks and manuals,
and I checked the 1.9 draft ANSI standard which says FOR THESE METHODS
that the result is completely unspecified, so that portable Smalltalk code
(and some does exist) cannot rely on the result AT ALL.

	I think in this case it is more or less the other way round.
	
	1) the main properties of an object are its identity and behaviour,
	   instance variables are encapsulated, and only in well defined cases
	   they are accessable from the outside. Collections are clearly objects
	   where instances should be accessable, but this does not mean that
	   their identity is secondary.
	
This is a completely general point, so completely misses the mark when we
are talking about methods that have no defined result at all in ANSI Smalltalk.

It also misses the mark concerning these methods.  The point is, why are
THESE methods returning something?  What USE could the result be?  To rebut
my claim about THESE methods, we need an argument that bears on the specifics
of how THESE methods are used/useful.

In particular, Collections are in practice *used* like values;
we almost always care more about the contents of a collection than
its identity.  (For example, these patches are the only serious use I have
ever made of aCollection == anotherCollection since I started Smalltalking.)
(I do not say we don't care at all about identity; if I make a series of
changes to a collection, I care which collection is affected.  But I care
more about what the final result is.)

	2) many Smalltalk methods for basic access in ALL dialects return the
	   argument ( at:put: atAllPut: add: remove: addAll: removeAll: etc )
	
In ANSI Smalltalk, these methods have NO DEFINED RESULT.  So this argument
is completely irrelevant.  I took the trouble to point out that in Gemstone
these methods DO return a defined result, but it is NOT the argument, it is
the receiver.  

But this argument also fails on another ground.
We can express the current state of affairs in two ways:
Batch update methods in Squeak that are given a batch argument
(1) return the batch argument
(2) return a collection of the elements that were added or removed.

It is totally useless to point to (1) and say "this is what we must preserve"
because EXACTLY THE SAME SET OF FACTS is currently covered by (2).  I can
with equal justice point to (2) and say "this is what we must preserve".

3) this way one can either choose to send the next message in a cascade
	   to the receiver object -> a add: b; nextMessage
	   or to the argument -> ( a add: b ) nextMessage.

	   I think this was the main intention for the existing
	definition, and not to answer the removed elements
	
Well, we have people reading this who can tell us for sure what the
intention was, and I hope one of them will.  The plain fact of the matter
is that Smalltalk systems vary enough in what they return that the ANSI
standard had to make the result unspecified, so portable code can't do this
anyway, whatever the original intention was.

The key point, of course, is that in the case
	(x removeAll: x)
it doesn't matter whether you do
	x removeAll: x; nextMessage
or	(x removeAll: x) nextMessage
because if you haven't installed my fix, BOTH calls give you wrong answers.


	Answering a new collection in special cases and the same collection
	in most cases would
	
	1) potentially break tons of existing programs, and introduce hard to 
	find errors 
	
This is an assertion without any grounds whatsoever; it is merely speculation.
I could speculate that there are "tons of existing programs" that rely on
the _state_ of the result and are currently broken because the result state
is wrong.  (I don't speculate this, but I could.)

What's more, it's demonstrably wrong speculation.

a) In every case that previously worked, the result is exactly what it
   always was.  Therefore, any program that is broken AFTER the fix was
   broken BEFORE it.

b) There is ALREADY a hard to find error, and it involves exactly the
   same case.  Any program that could possibly run into the problem
   already contains a hard to find error.

c) Smalltalk textbooks and manuals do not tell you what the result is,
   so any program that was relying on it was already relying on an
   implementation accident which is NOT something that portable code can
   rely on.  Remember, Gemstone returns the receiver, NOT the argument,
   and ANSI Smalltalk makes the result completely unspecified.

Concerning implementation accident.
Start with this:

    removeAll: aCollection
	aCollection do: [:each | self remove: each]

This always returns the receiver.  Notice with a start that #do: always
returns its receiver.  Realise that adding just one character will give
you the collection of things removed, which might be useful.  Add it.

    removeAll: aCollection
	^aCollection do: [:each | self remove: each]

Now you have something which was *MEANT* to return a collection of things
removed, but in one case doesn't because the state is wrong.  We would
have the *same* code whether the result *identity* was intended by the
designer or the result *state*.

As noted in earlier mail by someone else, some Smalltalks have exactly
this code.  Final step:  decide that it's pretty ugly relying on the
result of #do:  One reason is that there are #do: methods in Squeak
that *don't* return their receiver.  (WonderlandActor>>do: is just one
of them, but not the only one.)  Does anyone think it is a good idea to
have _quite_ as many #do: methods that don't follow the "standard" #do:
protocol?  Any rate, fix this ugliness by being explicit at the result:

    removeAll: aCollection
	aCollection do: [:each | self remove: each]
	^aCollection

Presto chango, we have a method that returns a particular result without
anyone ever *intending* it to be one thing or the other.  The fact that
Gemstone returns the receiver suggests that they may have stuck with the
simplest code.

	2) make the protocol of basic access methods much more difficult
	
	The more general rule is that the identity of the returned object
	should in all cases be either a different one or the same as the
	receiver/argument and at its best the message selector should give
	the programmer a clue if the answer is a copy or the orignal.
	
Where is this rule stated?

Let me offer you a counter-example:

	1 + 0		returns a result identical to the receiver.
	1 + 1		does not
	
The problem with general rules is that they sometimes fail to apply sanely
to particular cases.  In the words of the proverb,
    "The exception probes the rule."
In this case, we have these alternatives:

A.  Retain the status quo.  Tolerate methods in core classes that do not
    work as advertised and quietly give incorrect results even though
    correct results could be returned.

B.  Retain the status quo.  Document the restriction clearly, file it
    in the bottom drawer of the assistant undermanager of the
    Sirius Rag-and-Bone Men's Friendly Association's Association,
    and blame any programmer who runs into the problem for not reading
    the documentation.  (More fools them for reading several textbooks
    and not seeing any warnings...)
    Something very similar to this has been proposed, but not by me
    or Martin Wirblat.

C.  Make the code notify you of an error when it is about to violate
    its contract:  (self assert: [aCollection ~~ self].)
    I've suggested that as tolerable.

D.  Correct the methods to have the right effect on the receiver.
    Return the argument, even though the argument's state has now changed
    in a way that is practically certain to break code that tries to use it.
    (Martin Wirblat's suggestion, if I've understood him.)

E.  Correct the methods to have the right effect on the receiver.
    Always return a copy of the argument, and document this.
    (I _think_ Martin Wirblat would tolerate this.)
    This will break code that relied on the identity of the result,
    should any such (non-portable) code exist.

F.  Correct the methods to have the right effect on the receiver.
    Ensure that in cases that previously worked, the same result is returned,
    so that code that relies on the identity of the result continues to work
    in all the cases where it worked and continues to be broken in the
    remaining case.  (Should there _be_ any such code, which there is
    reason to doubt.)
    Ensure that in the previously broken case, a result with the correct
    state is returned, so that code that relies on the state of the
    result continues to work in all the cases where it owrked and now
    works in ALL meaningful cases.  (Should there _be_ any such code,
    which there is reason to doubt.)

G.  Correct the methods to have the right effect on the receiver.
    Change the result to be the receiver.

As a maintenance step, A is clearly intolerable, B is clearly intolerable,
C would do, and F is the change that does the least harm (no code could
be broken unless it was already broken in the same place).

Maintenance should lead to refactoring.  What these (and other similar)
methods need is not so much fixing as redesign.

Martin Wirblat is RIGHT that a method that sometimes returns the same
object it was passed and sometimes returns a copy is harder to understand
that always returns one or always returns the other.

I do not think he has sufficiently considered the fact that a method
which sometimes returns the object it was passed IN THE SAME STATE and
sometimes returns it in a radically different state is also hard to
understand.

We are left with three simple designs:

    E: always return a copy.  This seems a pity, because the result can
       never be used by portable code, and isn't used by any code I've
       had my hands on.

    G: always return the receiver.  It worked for Gemstone.  It's the
       result I've always wanted.  It's about as simple and cheap as
       you can get.  Code that formerly depended on the result can easily
       be changed:
	(x removeAll: y) nextMessage.
       ==>
	temp := y. x removeAll: temp.  temp nextMessage.

    H: always return some other result, such as nil, so that people are
       immediately alerted if they try to use the result.

The excess cost of E makes it a poor choice.

G or H would both be compatible with ANSI Smalltalk and would both be
sensible designs.  G would be more typical of Smalltalk in general,
but H would make it easier for people to adjust their code (should there
_be_ any code which looks at the result, other than code contrived for
this discussion).




More information about the Squeak-dev mailing list