[BUG]Collection>>removeAll:

Wed Aug 28 03:19:08 UTC 2002

goran.hultgren at bluefish.se wrote:
	I now have a suggestion - why don't we move over into the "do it" phase?
	Could Richard, which IMHO probably is the one who has this thread mostly
	covered in his head, summarize a concrete proposal of a change?

Er, I sent a [FIX] to the list several days ago.
Admittedly it didn't address #addAll:, but I've been waiting for the
dust in this discussion to die down before addressing that.

	Collection>>removeAll: aCollection
		"Remove each element of aCollection from the receiver. If successful
	for 
		each, answer aCollection. Otherwise create an error notification.
		ArrayedCollections cannot respond to this message."

		|aBlock|
		self isEmpty ifTrue:[^aCollection].
		self == aCollection ifTrue: [^self removeAll].
		aBlock := [self errorNotFound].
		^aCollection do: [:each |
		    self remove: each ifAbsent: aBlock]]

	- I check for empty as Stephan suggested. Richard may have other
	opinions but it would at least guard against rather slow removing of
	large collections from empty ones.

I am concerned about the #removeAll: problem because it is a correctness
issure (and if you think the implementation is right as it stands, I'm sure
you will at least agree that if the user expects it to do work in more
cases than it does there is a correctness issue even if we disagree about
where it is located).  I am not so concerned about performance issues,
and I am quite unconcerned about purely hypothetical performance issues.

	- I don't copy always, instead I call #removeAll and thus handles the
	case of "x removeAll: x".

Which doesn't need to do any copies.  We agree that #removeAll is a good idea.

	- I still return aCollection so the message still does what it should (I
	think).

This seems to be the result that will satisfy if not most people than
the most vocal.  I've tacitly accepted this in some recent messages.

	Then I added Collection>>removeAll (Gasp! How *dares* he change the
	protocol of Collection! :-) because I can't see why we shouldn't have
	such a method:

	Collection>>removeAll
		"Remove all elements in the receiver by simply removing them
		one by one. This base implementation uses a copy of the receiver
		in order to avoid the problem of iterating over a changing collection.
		Subclasses should typically reimplement this method to be more
	efficient."

		|aBlock|
		aBlock := [self errorNotFound].
		self copy do: [:each |
		    self remove: each ifAbsent: aBlock]].
		^self

More simply:
	self removeAll: self copy.
	^self

	The idea being that this implementation is rarely used since
	OrderedCollection etc would use special implementations [...]

	PS. I really think we should not be so *damn afraid* of
	changing/improving the base Smalltalk classes.  Sure, they have
	stood the test of time, but everyone makes mistakes and
	everything can be improved.  IMHO.

Squeak has _added_ to the Collection classes considerably.
Most Smalltalks don't let you do arithmetic on collections.
Um.
    Why are the arithmetic methods (like #+ and #adaptToCollection:andSend:)
    defined in Collection when they are only actually allowed for
    SequenceableCollection instances?

    Why are #union:, #intersection:, and #difference: classified as
    enumeration methods?  If they are enumeration methods, where is
    the block argument?  Is anyone else bothered by the fact that
    they give incorrect results for bags?

	a := #(3 1 4 1) asBag.
	b := #(2 7 1 8) asBag.
	a difference: b
    ==> a Bag(3 4)
        but the BAG difference of these two bags is a Bag(1 3 4)
        because a has two copies of 1 and b has only 1.
	(b1 difference: b2) occurrencesOf: x
	=
	((b1 occurrencesOf: x) - (b2 occurrencesOf: x)) max: 0

	NB: the comment in Collection>>difference: is
	"Answer the set theoretic difference of two collection"
	but (i) the result is not a set, it's a bag, and
	(ii) set theory tells us that the difference of two
	bags as a bag is what I said above.  At best the comment
	is ambiguous.

	Note that in out-of-the-box Squeak 3.2,
	#(1.0) asSet difference: {2.0-1.0} asSet => a Set()
	#(1.0) asSet difference: {2.0-1.0} asIdentitySet => a Set(1.0)
	#(1.0) asIdentitySet difference: {2.0-1.0} asSet => an IdentitySet()
	I am not sure that it makes sense for the behaviour to depend
	on the _argument's_ notion of equality rather than the
	_receiver's_ notion of equality like this.  To put it bluntly,
	"the set theoretic difference" is not a well defined concept
	when there are multiple notions of equality and the receiver
	does not control which is used.

	a intersection: b
    ==> a Bag(1 1)
	but the BAG intersection of these two bags is a Bag(1);
	(b1 intersection: b2) occurrencesOf: x
	=
	(b1 occurrencesOf: x) min: (b2 occurrencesOf: x)

	NB: the comment in Collection>>intersection: says
	"Answer the set theoretic intersection of two collections",
	but this isn't the set theoretic answer either,
	which would be a Set(1) or at least a Bag(1)

	The argument's-equality-definition-rules quirk is also present
	in Collection>>intersection:.

	a union: b
    ==>	a Set(1 2 3 4 7 8)
	but the BAG union of these two bags is a Bag(1 1 1 2 3 4 7 8)

	NB: the comment in Collection>>union: says
	"Answer the set theoretic union of two collections",
	and very much unlike the other two methods it does actually
	always answer a Set.  Or rather, it doesn't.  If the receiver
	is a specialised set (an IdentitySet or PluggableSet) the answer
	is the same kind of set as the receiver, and if the receiver is
	an IdentityBag you get an IdentitySet.

	In this case, it is *NOT* the argument's definition of equality
	which rules, but the receiver, making Collection>>union:
	quite incompatible with Collection>>intersection: and
	Collection>>union:.  Oops.

	Bag>>
	difference: aCollection
	    |result b|

	    result := self copy.
	    b := [].
	    aCollection do: [:each | result remove: each ifAbsent: b].
	    ^result

	Bag>>
	intersection: aCollection
	    |result t|

	    result := self copyEmpty.
	    contents keysAndValuesDo: [:key :value |
		t := value min: (aCollection occurrencesOf: key).
		t > 0 ifTrue: [result add: key withOccurrences: t]].
	    ^result

	Bag>>
	union: aCollection
	    ^(self copy) addAll: aCollection; yourself

    Assuming that single-element Bag additions and removals are O(1)
    and that traversing a collection with n elements is O(n), and
    that the receiver has m elements,
    this Bag>>difference: is O(m+n),
    this Bag>>union: is O(m+n),
    this Bag>>intersection: is O(m+n) if aCollection is a Set or a Bag,
        but O(mn) if aCollection is some kind of sequence;
        but then, the inherited method it replaces is _also_ O(mn)
        in that case.

    I'm not proposing this as a [FIX] yet, although I have one that seems
    to be working, because while I _am_ sure that the absence of bag
    union, bag difference, and bag intersection in Bag should be fixed,
    I am _not_ sure that it is safe to use the natural names for those
    operations, largely because the harder I look, the less I know what
    #union:, #intersection:, and #difference: are *supposed* to do.
    This time I can't appeal to the ANSI 1.9 draft, because these methods
    are not in that draft, not even in Set.  I can't find any mention of
    them in my copy of the VisualWorks Application Developers Guide either.

The more I poke into the Collection classes, the more I begin to wonder
whether C++ is as bad as I thought...   (Just joking, of course it is.)