[BUG]Collection>>removeAll:

Richard A. O'Keefe ok at cs.otago.ac.nz
Wed Aug 28 06:35:38 UTC 2002


"David Griswold" <David.Griswold at acm.org> wrote:
	I suspect that some (Richard at least), are feeling like myself and
	other nay-sayers are simply being obstructionist about a simple issue.

There are two quite distinct "negative" reactions:
(A) Don't do anything *yet* until we understand what to do.
(B) Don't do anything, leave it the way it is.

I've been seeing things that I can only understand as (B).

	But there is a very good reason for the kind of critical discussion that
	has been going on, and it is actually a good thing.

Rational/critical debate is always a good thing.

"If it was good enough for granpaw, it's good enough for me"
actually makes sense quite often, but ultimately it won't do.

	We are not talking about JoeBlowCornerOfTheSystem,
	but the *Collection* class and protocol, arguably as close as you can
	get to the heart of Smalltalk.

Yes, that's why it is so scary that the Collection classes are so amazingly
fragile when you poke them.

	This class and protocol have withstood an enormous amount of use
	in the real world, 

(1) That is, "it was good enough for granpaw".
(2) About Collection and its protocols, this isn't quite as true as it
    seems, since Squeak has a large number of collections that are
    not in the coloured books and don't seem to be in the commercial
    Smalltalks whose manuals I've been able to check.  Many of them
    seem to break assumptions in the older classes.  Squeak also has
    a large number of additional methods.

	and even a very slight
	change needs to be really, really thought about.

Er, why should *THIS* change be subject to that stricture when so many
*other* changes have not been?  I mean, seriously?

	I jumped into the conversation only because it seemed to me that
	the necessary "due diligence" wasn't being done.
	
But why for *THIS* change, when so many changes have been made to the
Collection classes without anywhere near this much debate?  Why not this
level of resistance to all the other things that have gone into the
Collection classes?

Let no-one mistake me.
I am in full agreement with careful design of core classes.
I am in full agreement that _every_ change requires design.

HOWEVER, I do not think that a correction has to be held up to
extremely stringent standards (that no other part of Squeak could meet)
in order to be useful or acceptable.

	I've made my points, but to be constructive and try to move forward,
	here is how I would reason about this problem and what to do about it.
	I'm laying it out in painful detail to show the kind of restraint I
	think is needed when dealing with important code like this.
	
	- it is clear that people do try to write (x removeAll: x), and that it
	doesn't do what they expect.  This seems to happen frequently enough
	that it would be desirable to at least let people know when this
	happens, so they can write their code a different way.  The currently
	accepted valid ways to empty a collection are to either create a new
	collection (usually the fastest way) or say (x removeAll: x copy).
	
Well, no.  There isn't any *polymorphic* way to create a new empty
collection, because for some very strange reason, there is no *general*
#copyEmpty method in Collection.  (I have suitable implementations for
#copyEmpty sitting in a change set waiting for resolution of the Bag issue.
Maybe I should split that out.)

And it really is very much an exaggeration to say that
"x removeAll: x copy" is an "accepted way" to empty a collection.
It is a way.  It is a good way.  But practically nobody knows about it.
It's what people do when they burn their fingers on "x removeAll: x".

	- If we come up with a way to detect this case and generate an error,
	then the only remaining rationale for extending removeAll: to support
	the special case of the receiver as an argument would be if it was
	required by ANSI, or made the library or its specification simpler and
	cleaner, or fixed an actual bug.  I believe that none of these has been
	demonstrated to be true.
	
Oh COME ON.  I have quoted the ANSI standard at length to demonstrate
that it *does* require the argument==receiver case to be supported.

I have also explained repeatedly not only _that_ the change makes the
specification simpler and cleaner, but _how_ it makes the specification
simpler and cleaner.

Rational/critical debate at least requires a rebuttal of these points,
not a flat denial that no demonstration was ever made.

	So how would we detect the error?  Richard has argued for just putting
	the check in.  I say it belongs in an assertion, even though it is a
	very fast check, and here is why:  every time a programmer thinks of
	some invariant like this that they want to check, they make a little
	'cost' calculation in their head, and if the check is going to have much
	of a performance cost, they tend to not do the check.

But how does this argue against an extremely fast check?
And how does it argue against a check THAT IS FASTER THAN AN ASSERT?

	Using assertions automatically for all of these kinds of checks
	eases that difficult cost decision, which encourages people to
	sprinkle these checks all over their code (a Good Thing).

Except that programmers who look up what #assert: does discover that it
is slow (block creation + message send).  (The 3.2 browser makes it
very easy to check this.  Type an assert: into some method, and use the
rightmost button to select "bytecodes" instead of "source".)

Lest I be misunderstood, I would be *delighted* if the compiler made
a special case of "self assert: -explicit block-" like it does for
ifTrue:.  Then David Griswold's argument *would* be valid and I *would*
be reasonably happy with "self assert: [aCollection ~~ self]."

	Just because this particular check happens to be very fast does
	not make it a special case.

The logic of this is "just because this is a special case doesn't mean
it is a special case."  He has just argued that programmers will put in
assert: because it is faster than a test, but is now arguing that
programmers will NOT put in a test that is (obviously) faster than an
assert:.  What kind of mindless animals are these programmers?

	So I would recommend the following fixes:
	
	1) Get the assertion mechanism working effectively and efficiently in
	the standard Squeak release.  Make sure its cost in development is
	minimal, and its cost in production, zero.
	
YES.  YES.  YES.

	2) Just add "self assert: [ otherCollection ~~ self ]." to the beginning
	of removeAll:.
	
Except that as someone else has argued, it doesn't actually solve the problem.

	Very, very simple. (a Good Thing).  It prevents this whole issue from
	being a problem, with no changing of the basic Collection interface or
	implementation required, and with no performance impact in production
	(Good Things).

No, it does not solve the problem.
The problem is that
 - #removeAll: fails to conform to the ANSI specification
 - #removeAll: is hard to *think* about because of an arbitrary restriction
 - the restriction has nothing to do with the semantics of the operation,
   it is solely a consequence of a naive implementation.

All it does is ensure that in likely cases involving existing collection
classes the bug will not go unreported.

Make no mistake, this suggestion would be a very definite improvement
over the present situation.  Bugs that report themselves are better than
bugs that quietly give wrong answers.

But it is nothing more than a bandaid.  The RIGHT answer is to design the
interface so that it has as few exceptions as reasonably possible and to
write the code so that it implements a clean specification.

This applies throughout the collection classes, which are badly in need
of an overhaul, not least some good documentation.

	Now, to the second, separate matter: is a new #removeAll method needed?
	
If (x removeAll: x) is not allowed to work, then yes.
There isn't any other polymorphic way to achieve what it does.

	I can see only one good reason for a #removeAll message.  That is that a
	large, tenured collection might need it as a hook to implement a more
	efficient way of being emptied.

More efficient than _what_?  What has "large" or "tenured" got to do with it?

The main reason that I can think of is Smalltalk's dependency mechanism.
If I have a changing collection which is being "watched" by some receiver
(quite a reasonable thing to do) then creating a new empty collection
instead of emptying the old one would be the wrong thing to do.
	
	Another argument that might be made is
	that people need a generally easy way of removing all of a collection's
	elements, but I don't buy that because most of the time, they should
	just create a new empty collection, which is usually faster, so we don't
	want to particularly encourage doing #removeAll anyway, since it is
	usually the slowest way to get the job done.
	
Speed isn't my issue.  Preserving identity so that the Observer pattern
continues to work is my issue.

	However, the large collection argument is a good one, I think.

	The question then becomes, where does this method belong?  One
	could argue that code dealing with such large collections almost always
	knows a lot more about the collection than just that it is a collection
	that supports remove:, so adding efficient removeAll methods to specific
	collection classes could be done without adding it to the Collection
	protocol itself.

I believe Goran's suggestion was that it should be implemented in each
class that introduces new structure.  So it belongs in Set and Bag and
OrderedCollection and ...  

I have suggested that #removeAll: itself might profitably be removed from
Collection on the grounds that the majority of descendants of Collection
cannot support it.

	On the other hand, since removeAll is conceptually built using remove:,
	adding a default implementation to Collection doesn't break any subclass
	of collection, since it doesn't use any message other than do: and
	remove:.

And most descendants of Collection do not and cannot support #remove:.
If that isn't a "code smell", what is?

	So basically, I think this argument comes down to weighing whether the
	"large collection" argument is good enough to overcome the drawback of
	breaking existing classes that implement Collection without inheriting
	from it.

I'm sorry, just what is being proposed that would break existing classes?

*LOTS* of methods have been added to Collection that are not supported by
classes that used to be reasonably good Collection mimics.

I myself tried to write a class (EnumerationWrapper) that looked a lot like
Collection without inheriting from it, and it was a really painful job
because Collection has a LOT of methods and many of them are difficult to
impossible to put in other classes.

	I'm not drawing a conclusion here, just trying to delineate
	the crux of the issue.  I personally think removeAll would be a nice
	addition, but it would break some code out there, and it's not used that
	often (and shouldn't be), so I'm on the fence about it.
	
Again, with the possible exception of code that deliberately causes a
DoesNotUnderstand, how would _adding_ a method to Collection "break" any
other class?  Can you give an actual example?

	If such a method was added, in Collection it needs to be something
	really simple.  You can do this safely and generally without copying if
	you write
		removeAll
			[ self isEmpty ] whileFalse: [ self remove: self anElement ]
	
which I have already proposed, except that instead of inventing a new method,
I used the #anyOne method which is _already_ there in Collection.
In fact, it's the very first method in the 'accessing' category.

	Unfortunately, without an inlining compiler, anElement is not very
	efficient in Squeak right now, although with inlining it could be more
	efficient than copying.

Returning one element could be more efficient than copying the whole
collection?  Of course it could, but what does that tell us?  It already
is.  And no, removing _all_ the elements this way could take quadratic
time (if, say, anyOne returned the last element of a sequence) and it
certainly wouldn't be a good way to remove elements from a Heap.

This is not to knock inlining, or native code generation either.

	So if there is any overall message here, it is: be very careful with
	code in classes like this, and show great restraint by making changes as
	minimal as possible.  The code in Collection is simple, but not
	simple-minded.  Don't underestimate the subtlety of why these
	simple-seeming methods are the way they are.
	
Well no, the problem is precisely that #removeAll: and #addAll: *are*
simple-minded.  The "set" operations #union:, #intersection:, and
#difference: appear to be a little more confused than that.

By and large, MOST of the code in the Collection classes is
"The Simplest Thing That Could Possibly Work",
except that sometimes it doesn't.




More information about the Squeak-dev mailing list