[BUG]Collection>>removeAll:

Wed Aug 28 10:12:18 UTC 2002

Hi all!

(Darn, just can't keep my fingers off this thread... :-)

Just to make sure though - not anything I say below is meant to be
offensive in any way. If something sounds annoying then I apologize and
blame the fact that english is not my native language.

"David Griswold" <David.Griswold at acm.org> wrote:
> To try to move this issue forward, and keep it in the realm of
> constructive discussion, I'm going to try to clarify what is going on,
> and put forward a proposal.
> 
> I suspect that some (Richard at least), are feeling like myself and
> other nay-sayers are simply being obstructionist about a simple issue.
> But there is a very good reason for the kind of critical discussion that
> has been going on, and it is actually a good thing.  The closer a change
> comes to the core of a system, the more it should be challenged, to make
> sure it is worthy.  We are not talking about JoeBlowCornerOfTheSystem,
> but the *Collection* class and protocol, arguably as close as you can
> get to the heart of Smalltalk.  This class and protocol have withstood
> an enormous amount of use in the real world, and even a very slight
> change needs to be really, really thought about.  I jumped into the

Yep, but sometimes you really wonder... Some of the bugs you can find in
there (mostly stuff being inherited but not really working well for
subclasses) feels quite elementary so I am not sure these classes "have
stood the test of time", I rather think it has to do with people being
"afraid" to fix them. I wonder what the original authors would say, who
wrote these classes in the beginning, was it Dan?

> conversation only because it seemed to me that the necessary "due
> diligence" wasn't being done.

Good.

> I've made my points, but to be constructive and try to move forward,
> here is how I would reason about this problem and what to do about it.
> I'm laying it out in painful detail to show the kind of restraint I
> think is needed when dealing with important code like this.
> 
> - it is clear that people do try to write (x removeAll: x), and that it
> doesn't do what they expect.  This seems to happen frequently enough
> that it would be desirable to at least let people know when this
> happens, so they can write their code a different way.  The currently
> accepted valid ways to empty a collection are to either create a new
> collection (usually the fastest way) or say (x removeAll: x copy).

A few remarks:
- The situation can occur without it being written "x removeAll: x"
intentionally by the developer. Rather "x removeAll:
(some-funky-expression-that-just-happens-to-evaluate-to-x)".
- Creating a new collection does not retain identity.
- "x removeAll: x copy" is really IMHO much more silly than "x
removeAll". It demands that you are aware that "x removeAll: x" doesn't
work and also that you realize that just sending in a copy would do the
trick. Not at all obvious, but finding a method called "removeAll" is
very obvious.

> - If we come up with a way to detect this case and generate an error,
> then the only remaining rationale for extending removeAll: to support
> the special case of the receiver as an argument would be if it was
> required by ANSI, or made the library or its specification simpler and
> cleaner, or fixed an actual bug.  I believe that none of these has been
> demonstrated to be true.

Why would that be "the only remaining rationale"? Honestly, I am not
joking?
A good reason IMHO is to improve the class. Squeak is not bound by ANSI.
Sure, being ANSI compliant (which it isn't) is a good thing, but NOT if
it stops us from improvement.

If we see a "developer trap" (=a good possibility of making mistakes),
which I surely would call this, then we should try to fix it.

(It's quite exactly the same problem as using removeAllSuchThat: on an
ArrayedCollection.

'abc' removeAllSuchThat: [:c | c = $d ]

The code runs just fine and thus pretends to remove all $d's from the
receiver, but if there would be a $d in there it will blow. It's a
developer trap. (Richard posted a simple FIX for this one))

And a fix can be anything from simply adding a comment, signalling or
simply making the darn code work.
In this case it seems to me that making it work would be nice. :-)

> So how would we detect the error?  Richard has argued for just putting
> the check in.  I say it belongs in an assertion, even though it is a
> very fast check, and here is why:  every time a programmer thinks of
> some invariant like this that they want to check, they make a little
> 'cost' calculation in their head, and if the check is going to have much
> of a performance cost, they tend to not do the check.  Using assertions
> automatically for all of these kinds of checks eases that difficult cost
> decision, which encourages people to sprinkle these checks all over
> their code (a Good Thing).  Just because this particular check happens
> to be very fast does not make it a special case.
> 
> So I would recommend the following fixes:
> 
> 1) Get the assertion mechanism working effectively and efficiently in
> the standard Squeak release.  Make sure its cost in development is
> minimal, and its cost in production, zero.
> 
> 2) Just add "self assert: [ otherCollection ~~ self ]." to the beginning
> of removeAll:.
> 
> Very, very simple. (a Good Thing).  It prevents this whole issue from
> being a problem, with no changing of the basic Collection interface or
> implementation required, and with no performance impact in production
> (Good Things).  It also gives everyone notice that they can now sprinkle
> their code with assertions (a *really* Good Thing).

I agree with almost everything you said here about asserts except for
the fact that I would not do an assert in this case and instead call
removeAll on self and thus make the code work. :-) What is your argument
for not doing that? (an honest question)

> Now, to the second, separate matter: is a new #removeAll method needed?
> 
> I can see only one good reason for a #removeAll message.  That is that a
> large, tenured collection might need it as a hook to implement a more
> efficient way of being emptied.  Another argument that might be made is
> that people need a generally easy way of removing all of a collection's
> elements, but I don't buy that because most of the time, they should
> just create a new empty collection, which is usually faster, so we don't

Identity is not retained. Sure, it would work most of the time but not
*all* the time.

> want to particularly encourage doing #removeAll anyway, since it is
> usually the slowest way to get the job done.

Nope. I would say that my proposed implementation of #removeAll in
OrderedCollection is faster than creating new OCs. And we all know OCs
are the workhorses. Hey, I can back that up with numbers! :-)

Time millisecondsToRun: [100000 timesRepeat: [OrderedCollection newW]]
-> 562 561 536

| oc |
oc _ OrderedCollection new.
Time millisecondsToRun: [100000 timesRepeat: [oc removeAll]] -> 365 375
368

using implementation as:
removeAll
	"Remove all elements in the receiver."
     self setCollection: ( Array new: 10 )

And when optimizing further as:
removeAll
	"Remove all elements in the receiver."
     array _ Array new: 10.
	firstIndex _ 3.
	lastIndex _ 2

I got: 191 179 177 187

But then I couldn't resist tweaking "OrderedCollection>>new" in the same
manner and then got:

Time millisecondsToRun: [100000 timesRepeat: [OrderedCollection newW]]
-> 321 327 332 317

But then I checked with "Smalltalk macroBenchmarks" and it only gave 4%
on benchmark2, nothing much on the others:
Before: #(19025 114834 42563 20877 0 11114 8115)
After: #(19305 110163 42722 20594 0 11051 8124)
(I haven't checked really if any of these benchmarks creates many OCs.
Perhaps it was just noise.)

Anyway it is WAY faster and creates less garbage! And I mean, of course
- "OrderedCollection new" does the same thing but more stuff too so
obviously #removeAll should be faster. By the way, "root table overflow"
- what is that? I got those when doing some other benchmarks.

> However, the large collection argument is a good one, I think.  The
> question then becomes, where does this method belong?  One could argue
> that code dealing with such large collections almost always knows a lot
> more about the collection than just that it is a collection that
> supports remove:, so adding efficient removeAll methods to specific
> collection classes could be done without adding it to the Collection
> protocol itself.

But why? It is such a basic message that I was actually perplexed to see
it wasn't in there already.
But instead I find rather exotic messages like #removeAllSuchThat: ...

> On the other hand, since removeAll is conceptually built using remove:,
> adding a default implementation to Collection doesn't break any subclass
> of collection, since it doesn't use any message other than do: and
> remove:.  It would, however, break classes that implement the Collection
> protocol but which don't inherit from it (ah, for mixins).
>
> So basically, I think this argument comes down to weighing whether the
> "large collection" argument is good enough to overcome the drawback of
> breaking existing classes that implement Collection without inheriting
> from it.  I'm not drawing a conclusion here, just trying to delineate
> the crux of the issue.  I personally think removeAll would be a nice
> addition, but it would break some code out there, and it's not used that
> often (and shouldn't be), so I'm on the fence about it.

And again, why shouldn't it be used? And to be so overly defensive of
adding (note, we are not changing an existing method, we are simply
*adding* one) a method seems very counter productive to me. It would
more or less totally hinder any evolution at all in the Collection
hierarchy...

> If such a method was added, in Collection it needs to be something
> really simple.  You can do this safely and generally without copying if
> you write
> 	removeAll
> 		[ self isEmpty ] whileFalse: [ self remove: self anElement ]
> 
> where anElement can be the following private method (it can be made
> public, and is in Strongtalk, but that's another issue):
> 	anElement
> 		self do: [ :el | ^el ].
> 		self error: 'Collection empty; no element to return'.
> 
> Unfortunately, without an inlining compiler, anElement is not very
> efficient in Squeak right now, although with inlining it could be more
> efficient than copying.  On a 1000 element Set of consecutive integers,
> the above implementation is a bit slower than (self removeAll: self
> copy), so maybe the copying version should be used for now instead, if
> removeAll is to be added (although anElement is a handy method to have
> around, in its own right).  On the other hand, much more efficient
> implementations of anElement can easily be written for important
> subclasses, and then removeAll can be done efficiently without copying.

Sure, this base implementation looks good to me. Or very simply as you
advised developers to do it "manually":

removeAll
	"Remove all my elements. We use a copy so that #removeAll:
	will work without stepping on itself. Subclasses typically reimplement
	this method more efficiently like for example in OrderedCollection."
	self removeAll: self copy

If you want simplicity, there you go! (Note: The comment could surely be
improved)

> So if there is any overall message here, it is: be very careful with
> code in classes like this, and show great restraint by making changes as
> minimal as possible.  The code in Collection is simple, but not
> simple-minded.  Don't underestimate the subtlety of why these
> simple-seeming methods are the way they are.

I agree but would like to add: "...but don't think they can't be
improved. We are all just humans, even Dan Ingalls." :-)

> Cheers,
> Dave

Cheers, Göran