FIX: Making Set/Dictionary etc. loops more robust

Richard A. O'Keefe ok at cs.otago.ac.nz
Tue Nov 30 04:07:39 UTC 2004


"Joshua Scholar" <jscholar at access4less.net> wrote:
	Changes Set>>do:
	from
	
	do: aBlock
	 tally = 0 ifTrue: [^ self].
	 1 to: array size do:
	  [:index |
	  | each |
	  (each _ array at: index) ifNotNil: [aBlock value: each]]
	
	to
	
	do: aBlock
	 tally = 0 ifTrue: [^ self].
	 array do: [:each | each ifNotNil: [aBlock value: each]].
	
There's one fairly major problem with Set>>do: which this change
does nothing about:  the fact that iterating over a Set of n elements
is not O(n).  This seems rather like polishing the brass on the Titanic.

	The difference is that adding to a Set or Dictionary can put a
	new rehashed hashed hash table into "array"
	
Yes, but it has always been strictly forbidden to make any kind of
change to a Set or Dictionary while iterating over it.  There are a number
of ways to deal with this:

    - Java actually maintains a change count and Set/Dictionary iterators
      watch out for changes, so If You Do Bad Things You Get Told

    - AWK (at least the 'mawk' implementation with which I'm familiar)
      actually goes to the trouble of making a copy of the key set of an
      array when you do 'for (key in array) ...' so If You Do Bad Things
      It Works Like You Should Have Expected'.

    - Smalltalk has traditionally gone a little bit crazy; If You Do
      Bad Things You Get What's Coming To You And Why Didn't You Read
      The Sources?

    - Verification tools like ESC/M3, ESC/Java, SPARK, and so on verify
      at 'compile' time that You Didn't Do Anything Bad You Good Little
      Programmer You (it's just not what the customer wanted).

	In the original version, if any block inside of a "do:" or any
	of the many loops that depend on "do:" called code that added to
	that Set then there was a chance the set would grow and rehash -
	and some element would be visited more than once and some not at all.
	
Indeed.  Break the contract and you lose your rights.

	The way it is with this change, adding to a Set in a loop adds
	an element that might not be visited, and deleting an element
	doesn't necessarily mean the element won't be visited but the
	meltdown scenario can't happen.
	
Add an element to a Set, and the following things can happen:
(1) The element is already present; nothing bad happens.
(2) The element is not already present and there is room in the array;
    (2a) the new element is inserted after the current iteration point=>
         it will be seen
    (2b) the new element is inserted before the current iteration point=>
         it will not be seen.
    We are already in Very Bad News territory.  It's darned near impossible
    to give this a coherent abstract semantics.  A "live view" should
    ensure that when you finish the loop you have seen _everything_ that
    is still in the set.  The proposed change does not ensure that.
    A "snapshot" should ensure that when you finish the loop you have
    seen all and only the things that _were_ in the set at the start;
    the proposed change does not ensure that either.
(3) The element is not already present and there is not enough room left
    (as defined by #fullCheck)
    The old code will switch over to the new array, but not scan all of it.
    The new code will continue scanning the old array, so will not see the
    addition or any other changes.  This is at least as much of a "meltdown"
    as what the old code did, and is arguably worse, in that it will hold
    onto a dead data structure that ought to be GC-meat.

Remove an element from a Set, and the following things can happen:
(1) The element is already absent; nothing bad happens.
(2) The element is present.  It's replaced by nil, then #fixCollisionsFrom:
    is called.  This can make fairly large scale rearrangements of the
    array.  As far as I can see, this can results in elements being missed
    or seen twice, WHICHEVER of the #do: versions is used.

In short, the only effective change is that the new Set>>do: will, in
some rare circumstances, visit more elements because it is scanning the
wrong array.  I don't really understand why this is useful, when we STILL
have the problem that additions may or may not be seen and removals can
STILL scramble a #do:.

	And of course the new version is at least as fast as the old one.
	
I benchmarked it, and the new code is slower.

	s := (1 to: 60000) asSet.
	[s do:  [:each |]] timeToRun	=> 42 msec (old version)
	[s do2: [:each |]] timeToRun	=> 63 msec (new version)

These times with Squeak 3.6-5429-full, MacOS 10.3.3 on a 1GHz G4 PPC.




More information about the Squeak-dev mailing list