[BUG]Collection>>removeAll:

Thu Aug 29 02:16:33 UTC 2002

"Richard A. O'Keefe" <ok at cs.otago.ac.nz> wrote:
> "David Griswold" <David.Griswold at acm.org> wrote:
> 	I suspect that some (Richard at least), are feeling
> like myself and
> 	other nay-sayers are simply being obstructionist about
> a simple issue.
>
> There are two quite distinct "negative" reactions:
> (A) Don't do anything *yet* until we understand what to do.
> (B) Don't do anything, leave it the way it is.
>
> I've been seeing things that I can only understand as (B).

What you are seeing is (C) Don't change Collection semantics unless a
very good reason is shown.  No one is just flatly saying "Don't do
anything, leave it the way it is".  I just don't think suggestions that
we change the semantics of #removeAll: have met the "good reason" test.

> 	But there is a very good reason for the kind of
> critical discussion that
> 	has been going on, and it is actually a good thing.
>
> Rational/critical debate is always a good thing.
>
> "If it was good enough for granpaw, it's good enough for me"
> actually makes sense quite often, but ultimately it won't do.
>
> 	We are not talking about JoeBlowCornerOfTheSystem,
> 	but the *Collection* class and protocol, arguably as
> close as you can
> 	get to the heart of Smalltalk.
>
> Yes, that's why it is so scary that the Collection classes
> are so amazingly
> fragile when you poke them.
>
> 	This class and protocol have withstood an enormous amount of use
> 	in the real world,
>
> (1) That is, "it was good enough for granpaw".
> (2) About Collection and its protocols, this isn't quite as true as it
>     seems, since Squeak has a large number of collections that are
>     not in the coloured books and don't seem to be in the commercial
>     Smalltalks whose manuals I've been able to check.  Many of them
>     seem to break assumptions in the older classes.  Squeak also has
>     a large number of additional methods.
>
> 	and even a very slight
> 	change needs to be really, really thought about.
>
> Er, why should *THIS* change be subject to that stricture when so many
> *other* changes have not been?  I mean, seriously?
> 	I jumped into the conversation only because it seemed to me that
> 	the necessary "due diligence" wasn't being done.
>
> But why for *THIS* change, when so many changes have been made to the
> Collection classes without anywhere near this much debate?
> Why not this
> level of resistance to all the other things that have gone into the
> Collection classes?

The other changes should have been subjected to this kind of discussion
too.  Just because slapdash programming exists does not justify more of
it.

> Let no-one mistake me.
> I am in full agreement with careful design of core classes.
> I am in full agreement that _every_ change requires design.
>
> HOWEVER, I do not think that a correction has to be held up to
> extremely stringent standards (that no other part of Squeak
> could meet)
> in order to be useful or acceptable.
>
> 	I've made my points, but to be constructive and try to
> move forward,
> 	here is how I would reason about this problem and what
> to do about it.
> 	I'm laying it out in painful detail to show the kind of
> restraint I
> 	think is needed when dealing with important code like this.
>
> 	- it is clear that people do try to write (x removeAll:
> x), and that it
> 	doesn't do what they expect.  This seems to happen
> frequently enough
> 	that it would be desirable to at least let people know when this
> 	happens, so they can write their code a different way.
> The currently
> 	accepted valid ways to empty a collection are to either
> create a new
> 	collection (usually the fastest way) or say (x
> removeAll: x copy).
>
> Well, no.  There isn't any *polymorphic* way to create a new empty
> collection, because for some very strange reason, there is no
> *general*
> #copyEmpty method in Collection.  (I have suitable implementations for
> #copyEmpty sitting in a change set waiting for resolution of
> the Bag issue.
> Maybe I should split that out.)
>
> And it really is very much an exaggeration to say that
> "x removeAll: x copy" is an "accepted way" to empty a collection.
> It is a way.  It is a good way.  But practically nobody knows
> about it.
> It's what people do when they burn their fingers on "x removeAll: x".
>
> 	- If we come up with a way to detect this case and
> generate an error,
> 	then the only remaining rationale for extending
> removeAll: to support
> 	the special case of the receiver as an argument would
> be if it was
> 	required by ANSI, or made the library or its
> specification simpler and
> 	cleaner, or fixed an actual bug.  I believe that none
> of these has been
> 	demonstrated to be true.
>
> Oh COME ON.  I have quoted the ANSI standard at length to demonstrate
> that it *does* require the argument==receiver case to be supported.
> I have also explained repeatedly not only _that_ the change makes the
> specification simpler and cleaner, but _how_ it makes the
> specification
> simpler and cleaner.
>
> Rational/critical debate at least requires a rebuttal of these points,
> not a flat denial that no demonstration was ever made.

I've been trying to have a rational debate with you, but you seem to
have a knack for treating anyone who doesn't agree with you as a dummy
(e.g. "Oh COME ON"), which makes it pretty trying.  I'll try to spell it
out for you, but this is the last time I will try to explain to you why
some people who aren't dummies don't agree with you about the ANSI
standard:

I don't think the ANSI standard says what you think it says.  Just
because it doesn't explicitly say that the receiver isn't allowed
doesn't mean that the receiver *must* be allowed.  I made a clear
argument that all programs in Smalltalk (or in any imperative language
for that matter) have a universal, *implied* constraint that you
shouldn't change objects that are being used by something that assumes
that they remain the same.  This language-wide constraint clearly
excludes (x removeAll: x), and the specification for removeAll: doesn't
override that.  Note that my argument about this implied constraint is
not that the libraries have been designed that way, or that that's the
way I think thing were meant etc., I'm saying this implied constraint is
required in Smalltalk; you just can't get around it.  It isn't stated
explicitly because it isn't a design choice- there is simply no other
reasonable general alternative in an imperative language like this, and
programmers are expected to know it.  Maybe it should be stated in the
spec, but regardless of whether it is stated or not, this constraint is
in force for the ANSI spec, because it *must* be.  Programmers should
read this additional constraint into all ANSI specifications, so that
removeAll: has the implied language "the argument can be any collection
that is invariant during the execution of this method".  I don't *like*
that, but it is necessary.

Now, given the above view, changing removeAll: to work with the receiver
as the argument is not a simplification, it is can be viewed as added
complexity, because it adds a special exception to removeAll: (that is
not there for the many other methods like this), that effectively says
"for this message, you have the additional capability of passing in
collections that are not invariant under this operation".  That is not a
simplification of the specification, because it requires additional
generalization language, and makes removeAll: work differently than
addAll: etc.

Now, it appears you don't agree with this, but you can't say you haven't
been rebutted.

> 	So how would we detect the error?  Richard has argued
> for just putting
> 	the check in.  I say it belongs in an assertion, even
> though it is a
> 	very fast check, and here is why:  every time a
> programmer thinks of
> 	some invariant like this that they want to check, they
> make a little
> 	'cost' calculation in their head, and if the check is
> going to have much
> 	of a performance cost, they tend to not do the check.
>
> But how does this argue against an extremely fast check?
> And how does it argue against a check THAT IS FASTER THAN AN ASSERT?

I think assertions are the right way to go, but an assertion vs. a fast
check in this case is no big deal to me in this case (as long as the
check is not *required* by the message semantics), since the check is
trivial and removeAll: is much slower anyway.  The big deal is changing
the semantics of removeAll:.  If your fast check tries to handle this
case and doesn't cause the programmer to be notified, that is what I
think hasn't been justified.

> 	Using assertions automatically for all of these kinds of checks
> 	eases that difficult cost decision, which encourages people to
> 	sprinkle these checks all over their code (a Good Thing).
>
> Except that programmers who look up what #assert: does
> discover that it
> is slow (block creation + message send).  (The 3.2 browser makes it
> very easy to check this.  Type an assert: into some method,
> and use the
> rightmost button to select "bytecodes" instead of "source".)
>
> Lest I be misunderstood, I would be *delighted* if the compiler made
> a special case of "self assert: -explicit block-" like it does for
> ifTrue:.  Then David Griswold's argument *would* be valid and
> I *would*
> be reasonably happy with "self assert: [aCollection ~~ self]."
>
> 	Just because this particular check happens to be very fast does
> 	not make it a special case.
>
> The logic of this is "just because this is a special case doesn't mean
> it is a special case."  He has just argued that programmers
> will put in
> assert: because it is faster than a test, but is now arguing that
> programmers will NOT put in a test that is (obviously) faster than an
> assert:.  What kind of mindless animals are these programmers?
>
> 	So I would recommend the following fixes:
>
> 	1) Get the assertion mechanism working effectively and
> efficiently in
> 	the standard Squeak release.  Make sure its cost in
> development is
> 	minimal, and its cost in production, zero.
>
> YES.  YES.  YES.
>
> 	2) Just add "self assert: [ otherCollection ~~ self ]."
> to the beginning
> 	of removeAll:.
>
> Except that as someone else has argued, it doesn't actually
> solve the problem.
>
> 	Very, very simple. (a Good Thing).  It prevents this
> whole issue from
> 	being a problem, with no changing of the basic
> Collection interface or
> 	implementation required, and with no performance impact
> in production
> 	(Good Things).
>
> No, it does not solve the problem.
> The problem is that
>  - #removeAll: fails to conform to the ANSI specification
>  - #removeAll: is hard to *think* about because of an
> arbitrary restriction
>  - the restriction has nothing to do with the semantics of
> the operation,
>    it is solely a consequence of a naive implementation.

As I have said repeatedly, I don't think any of your alleged three
problems are actually real problems.  I think #removeAll: does conform
to the ANSI spec, I think it is not an arbitrary restriction, but a
common and universal one, I think the restriction has a lot to do with
the semantics of the operation, and I don't think it is a consequence of
a naive implementation- I think the implementation does the correct
thing.

> All it does is ensure that in likely cases involving existing
> collection
> classes the bug will not go unreported.
>
> Make no mistake, this suggestion would be a very definite improvement
> over the present situation.  Bugs that report themselves are
> better than
> bugs that quietly give wrong answers.
>
> But it is nothing more than a bandaid.  The RIGHT answer is
> to design the
> interface so that it has as few exceptions as reasonably
> possible and to
> write the code so that it implements a clean specification.
>
> This applies throughout the collection classes, which are
> badly in need
> of an overhaul, not least some good documentation.
>
> 	Now, to the second, separate matter: is a new
> #removeAll method needed?
>
> If (x removeAll: x) is not allowed to work, then yes.
> There isn't any other polymorphic way to achieve what it does.
>
> 	I can see only one good reason for a #removeAll
> message.  That is that a
> 	large, tenured collection might need it as a hook to
> implement a more
> 	efficient way of being emptied.
>
> More efficient than _what_?  What has "large" or "tenured"
> got to do with it?
>
> The main reason that I can think of is Smalltalk's dependency
> mechanism.
> If I have a changing collection which is being "watched" by
> some receiver
> (quite a reasonable thing to do) then creating a new empty collection
> instead of emptying the old one would be the wrong thing to do.
>
> 	Another argument that might be made is
> 	that people need a generally easy way of removing all
> of a collection's
> 	elements, but I don't buy that because most of the
> time, they should
> 	just create a new empty collection, which is usually
> faster, so we don't
> 	want to particularly encourage doing #removeAll anyway,
> since it is
> 	usually the slowest way to get the job done.
>
> Speed isn't my issue.  Preserving identity so that the
> Observer pattern
> continues to work is my issue.
>
> 	However, the large collection argument is a good one, I think.
>
> 	The question then becomes, where does this method belong?  One
> 	could argue that code dealing with such large
> collections almost always
> 	knows a lot more about the collection than just that it
> is a collection
> 	that supports remove:, so adding efficient removeAll
> methods to specific
> 	collection classes could be done without adding it to
> the Collection
> 	protocol itself.
>
> I believe Goran's suggestion was that it should be implemented in each
> class that introduces new structure.  So it belongs in Set and Bag and
> OrderedCollection and ...
>
> I have suggested that #removeAll: itself might profitably be
> removed from
> Collection on the grounds that the majority of descendants of
> Collection
> cannot support it.
>
> 	On the other hand, since removeAll is conceptually
> built using remove:,
> 	adding a default implementation to Collection doesn't
> break any subclass
> 	of collection, since it doesn't use any message other
> than do: and
> 	remove:.
>
> And most descendants of Collection do not and cannot support #remove:.
> If that isn't a "code smell", what is?

I personally agree that #remove:, add:, etc. don't belong in Collection
itself.  In Strongtalk I factored them out into the separate protocol
Extensible, which OrderedCollections, Bags, etc. support.  In
Strongtalk, the Collection protocol is the true minimal protocol that
all collections support.  But in all other Smalltalks I've seen, they
are in Collection, so that is a separate discussion.

> 	So basically, I think this argument comes down to
> weighing whether the
> 	"large collection" argument is good enough to overcome
> the drawback of
> 	breaking existing classes that implement Collection
> without inheriting
> 	from it.
>
> I'm sorry, just what is being proposed that would break
> existing classes?

Changing the meaning of removeAll:.  It breaks all existing classes that
implement the Collection protocol or reimplement #removeAll:.

> *LOTS* of methods have been added to Collection that are not
> supported by
> classes that used to be reasonably good Collection mimics.
>
> I myself tried to write a class (EnumerationWrapper) that
> looked a lot like
> Collection without inheriting from it, and it was a really painful job
> because Collection has a LOT of methods and many of them are
> difficult to
> impossible to put in other classes.

That is exactly why I think the Collection protocol should be minimal,
especially in a system without mixins (which would allow people to never
have to reimplement the whole protocol).  I think what you are saying
actually supports my point that Collection should only be changed rarely
and reluctantly.  If all those other methods weren't thrown in there,
you would have had a lot easier time of it, and your code wouldn't break
as often (as it will every time the Collection class semantics are
changed).

You asked me for examples of classes that would break.  If the classes
you wrote support remove, *your* code is a perfect example of the kind
of code that will break.

> 	I'm not drawing a conclusion here, just trying to delineate
> 	the crux of the issue.  I personally think removeAll
> would be a nice
> 	addition, but it would break some code out there, and
> it's not used that
> 	often (and shouldn't be), so I'm on the fence about it.
>
> Again, with the possible exception of code that deliberately causes a
> DoesNotUnderstand, how would _adding_ a method to Collection
> "break" any
> other class?  Can you give an actual example?

I just gave the example above: any other class that implements
Collection without inheriting from it, including your own class you
mentioned above.

> 	If such a method was added, in Collection it needs to
> be something
> 	really simple.  You can do this safely and generally
> without copying if
> 	you write
> 		removeAll
> 			[ self isEmpty ] whileFalse: [ self
> remove: self anElement ]
>
> which I have already proposed, except that instead of
> inventing a new method,
> I used the #anyOne method which is _already_ there in Collection.
> In fact, it's the very first method in the 'accessing' category.

Cool!  It is obviously convergent evolution; I hadn't noticed that
method- it does the same thing as my anElement.  Which makes the code
simpler.  If you already proposed this algorithm, good.

> 	Unfortunately, without an inlining compiler, anElement
> is not very
> 	efficient in Squeak right now, although with inlining
> it could be more
> 	efficient than copying.
>
> Returning one element could be more efficient than copying the whole
> collection?  Of course it could, but what does that tell us?
> It already
> is.  And no, removing _all_ the elements this way could take quadratic
> time (if, say, anyOne returned the last element of a sequence) and it
> certainly wouldn't be a good way to remove elements from a Heap.
> This is not to knock inlining, or native code generation either.
>
> 	So if there is any overall message here, it is: be very
> careful with
> 	code in classes like this, and show great restraint by
> making changes as
> 	minimal as possible.  The code in Collection is simple, but not
> 	simple-minded.  Don't underestimate the subtlety of why these
> 	simple-seeming methods are the way they are.
>
> Well no, the problem is precisely that #removeAll: and #addAll: *are*
> simple-minded.  The "set" operations #union:, #intersection:, and
> #difference: appear to be a little more confused than that.
>
> By and large, MOST of the code in the Collection classes is
> "The Simplest Thing That Could Possibly Work",
> except that sometimes it doesn't.

You may think it is simple-minded; I don't.  You may think the code in
Collection shouldn't place such a premium on simplicity; I do.  That
doesn't make *me* simple-minded.

Cheers,
Dave