[BUG]Collection>>removeAll:

Tue Aug 27 01:31:57 UTC 2002

"Andrew C. Greenberg" <werdna at mucow.com> wrote:
	Richard makes the untrue assumption that the only parameter
	aCollection that can give rise to this difficulty occurs when
	aCollection is an identity of self.  This is not the case.  The
	problems will occur, or similarly buggy problems will occur,
	WHENEVER a remove:  of an element of the receiver will change
	the parameter.

I didn't, as it happens, assume this.  I was aware that this was an
issue, and I checked it.

	There are many possible objects, some non-identical to self that
	could give rise to this problem.  Imagine a receiver collection
	corresponding to a sequentially indexable list of objects, and a
	parameter collection corresponding to a slice taken therefrom,
	such as the array and slice objects in Python.

Can you make this a bit clearer?
Are you saying that there is some class *in Squeak 3.2 right now*
where this happens,
or are you making the uninformative, and therefore unhelpful, observation
that someone could conceivably make a subclass of Collection which could
be made to break in this fashion?

Again, it so happens that I considered this particular case.
But there _isn't_ any 'slice' collection class in Squeak 3.2.
All this tells us is that IF a new class were added it COULD break
pretty much anything, which we already knew.

By the way, Lisp displaced arrays are an older (and richer) example
than Python slices.

The nearest equivalent to slices in Squeak is MappedCollection, which
I explicitly considered.

I must admit, however, that I only considered the case of a MappedCollection
as a receiver, and not as an argument.

	There is no doubt that Richard handled a single case efficiently and 
	elegantly.  But it is a single case.  And an undefined one.

But it is a case that actually arose.  (Not in my code, BTW.)

And it is most certainly not an undefined case.

This is such an elementary point that I am surprised that I have to keep
on hammering it.  The text of the 1.9 draft of ANSI Smalltalk makes this
case just as much defined as any other case.  The comments in Squeak and
other Smalltalks make this case just as much defined as any other case.

I have copied what the ANSI 1.9 draft has to say.
Here's what the VW Application Developers' Guide has to say:

	Removing a Subcollection

	The removeAll:  message allows you to remove all memebers of one
	collection from a target collection.   Send removeAll: to the
	collection from which you want elements removed.  The argument
	is a collection containing the elements to be removed.

	    | list |
	    list := List withAll:  ColorValue constantNames.
	    list removeAll:  #(#red #green #blue).
	    ^list

	If an element is not found, an error is reported.  Because
	removeAll: is defined in Collection, it can be used with any
	collections as receiver and argument.

Does this leave the case where the receiver and argument are the same
undefined?  Like h**l it does!  It specifies the effect fully and
completely:  all the elements in the argument (= the receiver) are to
be removed from the receiver (thus leaving it empty).  It _does_ leave
the result completely undefined

It is the naive implementation which has a hole in its domain.
NOT the definition.

There are a very great many ways to implement #removeAll: so that it
does exactly what the definition says.  Many of them require different
implementations for different subclasses of Collection, but then, so
does #+ require different code for different subclasses of Number.

Here are two ways to implement #removeAll: which are free of aliasing
problems (ignoring the possibility of arbitrarily perverse #remove:
implementations):

    removeAll: aCollection
        aCollection copy do: [:each | self remove: each].
        ^aCollection

    removeAll: aCollection
	|temp|
	temp := self copy.
	aCollection do: [:each | temp remove: each].
	self become: temp.
	^aCollection

Whether the second of these is reasonable depends on how #become: is
implemented.  In some systems, it could be.  How about the first?
Suppose we change it ever so slightly.

    removeAwl: aCollection	"different name for benchmarking"
        ^aCollection copy do: [:each |
	    self remove: each ifAbsent: [self errorNotFound: each]]

It's benchmark time.

    a := (1 to: 30) asArray.
    b := (2 to: 30 by: 2) asArray.
    [10000 timesRepeat: [a asOrderedCollection]] timeToRun
==> 1.982 seconds
    [10000 timesRepeat: [a asOrderedCollection removeAll: b]] timeToRun
==> 4.682 seconds (2.7 seconds for removeAll:)
    [10000 timesRepeat: [a asOrderedCollection removeAwl: b]] timeToRun
==> 4.447 seconds (2.46 seconds for removeAwl:)

Come to think of it, the error message is not specified.
Collection>>errorNotFound: doesn't actually use its parameter, so add
    errorNotFound
	self error: 'Some object is not in the collection.'
and change removeAwl: to

    removeAwl: aCollection
	|aBlock|
	aBlock := [self errorNotFound].
	^aCollection copy do: [:each |
	    self remove: each ifAbsent: aBlock]]

    [10000 timesRepeat: [a asOrderedCollection removeAwl: b]] timeToRun
==> 4.239 seconds (2.26 seconds for removeAwl:)

So now we have a version of #removeAll: which is immune to any plausible
kind of aliasing, yet it takes only 84% of the time of the original
version.

That's right, the message send and block creation involved in calling
#remove: repeatedly are so expensive that getting rid of them more than
compensates for introducing a copy.  (In this case.)

Try again, pushing the upper bound to 90 instead of 30.

    overhead               5.241 seconds
    overhead + removeAll: 18.584 seconds => removeAll: 13.34 seconds
    overhead + removeAwl: 16.182 seconds => removeAwl: 10.94 seconds

Yup, a safe version that does a copy can be FASTER than the buggy version.
With these numbers, it is vain to pretend that no reasonable programmer
could have expected any implementation other than the buggy one.  It is
now clear that a reasonable user could reasonably expect a #removeAll:
implementation which is BOTH immune to aliasing problems (even aliasing
that involves slices, which currently don't exist) AND faster than the
implementation we now have.

One more test.

    a := (1 to: 90) asArray.
    b := (2 to: 90) asArray.
    [10000 timesRepeat: [a asSet]] timeToRun
==>  14.46 seconds
    [10000 timesRepeat: [a asSet removeAll: b]] timeToRun
==> 183.22 seconds (so removeAll: 168.76 seconds)
    [10000 timesRepeat: [a asSet removeAwl: b]] timeToRun
==> 180.45 seconds (so removeAwl: 165.99 seconds)

This time the safe version takes 98% of the time of the buggy version,
but the speedup is still above noise level.

Gosh, Sets and Dictionaries are slow.  To think I used to imagine that they
used such a non-obvious implementation for efficiency...

	No need to do this, because coders are supposed to understand
	that parameters to a collection iteration must be invariant.

Yes, BUT (and it is not just an extremely big BUT, it is a BUT that I
have addressed at some length beforehand)
    CODERS HAVE NO REASON TO EXPECT THAT THERE IS SUCH AN ITERATION.

I have now demonstrated that
 - it is possible to implement #removeAll: for OrderedCollections so that
   it costs O(receiver size + aCollection size) instead of
   O(receiver size * aCollection size) as at present, and the implementation
   that does this **does not iterate over aCollection**.
 - it is possible to implement #removeAll: in much its present way yet
   detectably faster **without iterating over aCollection**.

How often, and in how many ways, does it need to be said that this
particular iteration is an implementation accident, having neither
a necessary nor a desirable connection with achieving the stated purpose
of the method?

	Best that this be done without exception -- simply announcing that the 
	result is undefined when the parameter is not invariant -- rather than 
	simply adding a few "easy to evaluate" ad hoc cases, or to compromise 
	system performance unnecessarily to satisfy Richard's sense of 
	aesthetics.

Why make completely unfounded accusations that any fix I have suggested
would "compromise system performance"?  I have done the benchmarks.
The simple fix (which is adequate for the existing classes) DOES NOT
COMPROMISE SYSTEM PERFORMACE.  Such slowdown as there may be is not
measurable.  The more sophisticated fix not only doesn't slow anything
down, it goes FASTER.

	After deep and due consideration, I am even more compelled after 
	reading Richard's response than I was beforehand: Ralph and David are 
	right and Richard is wrong on this point.

Yes, but you had to make a number of untrue statements to come to that
conclusion.

Why am I the only person in this thread who feels compelled to back up
assertions about performance with actual benchmarks?

In short, Andrew Greenberg dismisses my concerns about how hard it is
for people to understand #removeAll: and use it correctly on the basis of

 - speculations about possible subclasses that might break the simple fix
   (whereas possible subclasses could easily break practically anything   
   in Squeak)

 - unfounded accusations that it would compromise system performance
   (whereas a more sophisticated fix demonstrably IMPROVES performance)

 - failure to distinguish between the specification of #removeAll:
   (x removeAll: y removes all the elements of y from x provided x can
   change size) and a rather inefficient implementation, apparently
   expecting every Smalltalk programmer to be intimately familiar with
   the implementation of every method (otherwise, why expect them to
   be familiar with the implementation of _this_ method?)

Not, in my view, adequate grounds for coming to any conclusion.

I am, however, grateful to him for forcing me to benchmark a version
of #removeAll: that copies the argument.  It's a clear warning to us
all, I had _feared_ that doing copies would be too slow, so I never
bothered to try it before.