[BUG]Collection>>removeAll:

Richard A. O'Keefe ok at cs.otago.ac.nz
Wed Aug 28 03:48:26 UTC 2002


Stephan Rudlof <sr at evolgo.de> wrote concerning his
proposal to check for empty receivers in #removeAll:
	Collection>>removeAll: aCollection
	    self isEmpty
	    ifFalse: [self assert: [self ~~ aCollection].
	             aCollection do: [:each | self remove: each]].
	  ^aCollection

I have previously pointed out that this blocks the error notifier
for the #assert:.  

	| colls emptyColl |
	colls _ Collection allSubInstances.
	emptyColl _ OrderedCollection new.
	Smalltalk garbageCollect.
	{
		[colls do: [:coll | emptyColl removeAll: coll]] timeToRun.
		[colls do: [:coll | emptyColl removeAllSr: coll]] timeToRun.
	}
	
	gives an error (for >>removeAll:)! We both have overlooked, that
	in the current implementation it is *not* allowed to try to remove
	something from an empty collection.  So my speed up idea is nonsense
	with the current semantics, which automatically gives an error notifier.
	
Having pointed out that it blocked one error notifier,
I didn't see any need to point out the (surely obvious)
implication that it also blocked the other.
	
Try #removeAllFoundIn: instead.

Even so, _this_ timing test is about as uninteresting as it can be.
The really important figure is p, and in a Squeak 3.2 which I haven't
changed much
	(Collection allSubInstances select: [:each | each isEmpty]) size
	----------------------------------------------------------------
	(Collection allSubInstances) size
is approximately 3%.  That's not p, of course, but it is a suggestion
that empty collections are comparatively rare.

The number we need is WHAT PROPORTION OF #removeAllFoundIn: MESSAGES
ARE SENT TO EMPTY COLLECTIONS.

	What remains?
	
Well, it remains for you to demonstrate that sticking a (self isEmpty)
test into #removeAllFoundIn: provides a useful speedup in practice.

	| colls emptyColl |
	colls _ Collection allSubInstances.
	emptyColl _ OrderedCollection new.
	Smalltalk garbageCollect.
	{
		[colls do: [:coll | emptyColl copyWithoutAll: coll]] timeToRun.
		[colls do: [:coll | emptyColl copyWithoutAllSr: coll]] timeToRun.
	}
	
Looking at copyWithoutAll: we find

    copyWithoutAll: aCollection
	^self reject: [:each | aCollection includes: each]

which is already O(1) when the receiver is empty (because it doesn't
as much as touch aCollection when the receiver is empty), so there isn't
much of a performance problem to solve there.  Let me repeat that:
there is no performance problem with removing a large collection from
an empty one using #copyWithoutAll:, so there is nothing that really
needs fixing there.

So we find that

 - x copyWithoutAll: y is already O(1) when x isEmpty
   without adding any special test;

 - x removeAll: y must not make a special case when x isEmpty
   (unless y isEmpty also), in fact, if x is empty and y isn't
   an error will be reported after O(1) work so there isn't
   even a performance problem finding the error;

 - x removeAllFoundIn: y is the only remaining method where
   removing a large y from an empty x could be expensive,
   but then *in context* that's likely to be dominated by the
   cost of building y in the first place.

Personally, I'm all in favour of special cases in implementations
when they solve real problems.

    a := (1 to: 200*1000) asArray.
    e := OrderedCollection new.
    [e removeAllFoundIn: a] timeToRun
==> 4 seconds
    [e removeAllFoundInSR: a] timeToRun "really had 200000 timesRepeat:"
==> 4 milliseconds

So yes, sticking a (self isEmpty) check in #removeAllFoundIn: can save
as much time as you like (just make the argument as big as you want);
the question is, does that happen often enough to pay off.

Several of the sends to #removeAllFoundIn: in the system have arguments 
that are obviously very short or are obviously shorter than the receiver
or else are called infrequently.  I have no idea how other people use
#removeAllFoundIn:.



More information about the Squeak-dev mailing list