[BUG]Collection>>removeAll:

Fri Aug 23 03:20:03 UTC 2002

Chris Norton <chrisn at Kronos.com> wrote:
	Furthermore, I'd like to re-iterate what several people have now
	said.  It's not practical to slow the whole system down to guard
	against this rare error condition.

This is a bare assertion without any evidence for it.

Remember, the extra code we are talking about could be as low as
	(aCollection == self) ifTrue: [
	    ^self removeAll: aCollection copy].

What _is_ this overhead?  With a functioning JIT, which I hope we get
soon on the PowerMac and SPARC that I use, the code should be

    compare register X with register Y
    branch if not identical to L1
	"code for ^self removeAll: aCollection copy"
L1: "code that would have been here anyway"

Even with bytecodes, it is

    push aCollection		-- single special-case byte
    push self			-- single byte
    send ==			-- single special-case byte
    branch if false to L1	-- short conditional branch

or about a 6th of a microsecond.  Tell you what, let's _measure_ this
overwhich which is "not practical".  Let's find out what it is.

Start with an image with the original #removeAll:.  Change it to

    removeAllSafely: aCollection
	aCollection == self
	    ifTrue: [^self removeAllSafely: aCollection copy].
	^aCollection do: [:each | self remove: each]

Save that.  Now open a workspace and type

    a := (1 to: 20) asArray.
    b := (2 to: 20 by: 2) asArray.
    [100000 timesRepeat: [a asOrderedCollection]] timeToRun

I get 14764.  Change the last line to

    [100000 timesRepeat: [(a asOrderedCollection) removeAll: b]] timeToRun

I get 30955.  Difference: 16.2 seconds all told, or 162 microseconds each.

Change the last line to

  [100000 timesRepeat: [(a asOrderedCollection) removeAllSafely: b]] timeToRun

Hey, I got 30955!  There was literally no measurable difference!
Of course these numbers vary a bit.  The highest measurement I got was
31306 milliseconds, or 165 microseconds per #removeAllSafely:.

So the very highest measured overhead in this comparatively tiny example
(remember, #removeAll: is O(m.n) where m is the size of the receiver and
n is the size of its argument) was under 2%.  But that's noise; the amount
of variation in measured times is enough that the two methods CANNOT BE
TOLD APART in terms of performance.

An overhead which is literally too small to detect reliably
is so much that it is "not practical"?

"It's not practical to slow the whole system down" so little that
you can't tell you're doing it?

This is a very VERY strange view of performance.

(Note that I have tweaked the definition slightly.  This version has
 no extra local variable, no extra assignment, and it manages to save
 one byte-code in every case, so the actual overhead is just 3 very
 simple byte-codes, not even 4. The special case is slower, but at least
 it works.)

Change the upper bound from 20 to 50, change repeat count to 10000.
base time = 		3037 msec.
removeAll: time =	8607 msec.
removeAllSafely: time =	8593 msec.
No evidence of a slow-down here.

Try the other way.  Change the upper bound from 20 to 10,
leave repeat count at 100000.
base time = 		 9367 msec.
removeAll: time =	17102 msec
removeAllSafely: time =	17049 msec.
Nope, still no evidence of a slow-down here.

That's right, sometimes #removeAllSafely: ran faster than #removeAll:.
Sometimes it didn't, by about the same amount.

These are all actual measured times on a 250MHz PowerMac G3,
running Squeak 3.0 #3552 under MacOS 8.6.

I do not mean to assert that there is NO overhead, only that the AMOUNT
of overhead is so small that it is not detectable within Squeak
#timeToRun measurement error.  (With a sufficiently large repeat count,
no doubt it could be measured.  Anyone who wants to may try that.  Several
days might be long enough.)

	Proper bounds checking in this case is the responsibility of the
	sender.

But there is no bounds checking issue here.

The point is that
 - (x removeAll: x) and (x addAll: x) MAKE SENSE.
   All there is dispute about what the result should be in such cases,
   it doesn't matter in ANSI Smalltalk because there is no defined result.

 - There is an ANSI standard for Smalltalk,
   document number ANSI INCITS 319-1998,
   formerly        ANSI  NCITS 319-1998.
   The price for an electronic copy is apparently USD 18;
   I must ask whether the department can buy a copy.
   Can some reader who has a copy of the ANSI Smalltalk final version
   please correct me if I get anything wrong here?

   The 1.9 draft has been freely available in Word .DOC format,
   and if you check that, you find these descriptions:

   5.7.2 <abstractDictionary>
      5.7.2.1  addAll: dictionary    
	...
	This message is equivalent to repeatedly sending the
	#at:put: message to the receiver with each of the keys and
	elements in dictionary in turn.  If a key in dictionary is key
	equivalent to a key in the receiver, the associated element in
	dictionary replaces the element in the receiver.
	...
	Return values UNSPECIFIED.

	[This definition makes sense when dictionary==receiver.
	 There is no permission given for special cases that don't work.]

   5.7.5 <extensibleCollection>
     5.7.5.2 addAll: newElements
	...
	This message adds each element of newElements to the receiver.
	The operation is equivalent to adding each element of newElements
	to the reciever using the #add: message with the element as the
	parameter.  The newElements are traversed in the order
	specified by the #do: message for newElements.
	...
	Return values UNSPECIFIED.

	[This definition makes sense when newElements==receiver.
	 newElements copy do: [:each | self add: each]
	 will accomplish _exactly_ what the definition here calls
	 for.  The definitions calls for the elements to be added
	 in a particular order, it does not actually call for #do:
	 to be used, and it does not call for the additions to be
	 done _during_ the call to #do: if there is one.
	 There is no permission given for special cases that don't work.]

   5.7.6 <Bag>
     5.7.6.3 addAll: newElements
	...
	This message adds each element of newElements to the receiver.
	The operation is equivalent to adding each element of newElement
	to the receiver using the #add: message with the element as the
	parameter.  The newElements are traversed in the order
	specified by the #do: message for newElements.
	...
	Return values UNSPECIFIED

	[This looks like an incomplete edit from 5.7.5.2, presumably
	 fixed in the final version?  A Bag has no reason to care what
	 order the new elements are added, does it? 
	 My comments on 5.7.5.2 apply here also in any case.]

   5.7.7 <Set>
      5.7.7.2 addAll: newElements
	[same as Bag]

   5.7.18 <OrderedCollection>
      5.7.18.6 addAll: newElements after: target
	Synopsis: ... answer newElements.
	Return Values UNSPECIFIED.
      5.7.18.7 addAll: newElements afterIndex: index
	Synopsis: ... answer newElements.
	Return Values UNSPECIFIED.
      5.7.18.8 addAll: newElements before: target
	Synopsis: ... answer newElements.
	Return Values UNSPECIFIED
      5.7.18.9 addAll: newElements beforeIndex: index
	Synopsis: ... answer newElements
	Return Values UNSPECIFIED
      5.7.18.10 addAllFirst: newElements
	Synopsis: ... Answer newElements
	...
	This message is used to iteratively add each element of a given
	collection to the beginning ot the receiver's elements.
	The operation is equivalent to adding each successive element
	of newElements to the receiver using the #addFirst: message
	with the element as the parameter, where the newElements are
	traversed in the order specified by the #reverseDo: message
	to newElements.
	Return Values UNSPECIFIED

	[There is some serious incomplete editing here.  I hope this was
	 fixed in the final version of the standard.  It looks as though
	 they took the synopses from old text and then changed the
	 Return Values when they realised that the result wasn't as
	 portable as they thought and the original specification didn't
	 always make sense.

	 All these operations CAN be implemented in full agreement with
	 the definitions without any special cases, and the definition
	 does not allow any special cases not to work.]

      5.7.18.11 addAllLast: newElements
	Synopsis: ... Answer newElements
	...
	This message is used to iteratively add each element of a given
	collection to the end of the receiver's elements.
	The operation is equivalent to adding each successive element
	of newElements to the receiver using the #addLast: message with
	the element as the parameter, where thew newElements are
	traversed in the order specified by the #do: message for
	newElements.
	...
	Parameters UNSPECIFIED
	Return Values <sequencedCollection> parameter

	[There is no #addAll: method for OrderedCollection in ANSI Smalltalk.
	 In Squeak, it is a synonym for #addAllLast:, so 5.7.18.11 is the
	 one that applies when we wonder what OrderedCollection>>addAll:
	 should do in Squeak.  It is unfortunate that this is the least
	 polished of the definitions I've quoted; it looks as though the
	 Parameters and Return Values lines were swapped somehow.

	 There is NO permission given for ANY special cases that don't work.]

Now here's #removeAll:.  Thankfully, there's only one of it.

   5.7.5 <extensibleCollection>
      5.7.5.5 removeAll: oldElements
	...	
	This message is used to remove each element of a given collection
	from the receiver's elements.  The operation is defined to be
	equivalent to removing each element of oldElements from the
	receiver using the #remove message with the element as the
	parameter.
	The behavior is undefined if any element of oldElements
	is not found.
	...
	Return Values UNSPECIFIED

	[Note that an implementation of the form
	 oldElements copy do: [:each | self remove: each]
	 would fully conform to this specification.
	 No permission is given for any special cases other than the
	 cases where some element is not found when it is to be removed.
	 There is NO requirement that #do: actually be used, or that the
	 elements be processed in any particular order, or that the removals
	 be done inside the #do: if there is one.]

Chris Norton wrote:
	Furthermore, I'd like to re-iterate what several people have now said.  It's
	not practical to slow the whole system down to guard against this rare error
	condition.  Proper bounds checking in this case is the responsibility of the
	sender.  I recommend updating the comment in the method to document the
	error condition and to leave the code unchanged.

I'm sorry, but this is NOT an option if we want Squeak's collection classes
to approximate ANSI compatibility.  The standard does NOT allow
receiver == aCollection as a special case.  Any "language lawyer" will
tell you that the standard clearly requires ALL of these add*All: and
the removeAll: method to allow receiver == aCollection and clearly define
the effect to be produced.

There is no "bounds checking" here, and it is blatantly unreasonable to
require programmers to put hacks in _their_ code to work around a failure
to conform to the ANSI definitions when it is trivially easy to make the
code conform without measurable overhead.

There IS an "official" specification.
The comments in Squeak are compatible with it.
The code is not.
The code is broken.
Fixing it makes life simpler for programmers who have one special case
 fewer to worry about.
The cost of the fix in previously working cases is too small to measure.

Why is this so hard to accept?