FIX: Making Set/Dictionary etc. loops more robust

Tue Nov 30 16:30:52 UTC 2004

> And of course the new version is at least as fast as the old one.
>
> I benchmarked it, and the new code is slower.
>
> s := (1 to: 60000) asSet.
> [s do:  [:each |]] timeToRun => 42 msec (old version)
> [s do2: [:each |]] timeToRun => 63 msec (new version)

You're right, I made a mistake.

I had originally writen the method:
do: aBlock
    | arrayHolder |
     tally = 0 ifTrue: [^ self].
     arrayHolder_array.
     1 to: arrayHolder size do: [:index | | each | (each _ arrayHolder at:
index) ifNotNil: [aBlock value: each]].

Which shouldn't be a slow down - it just adds a local variable to the
original - but then I noticed that the code in SequenceableCollection do:
looked identical and thought using SquenceableCollection do: looked
neater... But forgot that it has to jump though an extra block to get to the
test.

Anyway since I already included a change set that some might consider a
malicious slowdown, I've included a change set this time that has my
arguably dubious improvement without slowing things down.

You know, it occures to me to wonder why common looping constructs near the
bottom of hierachies haven't been turned into primitives.  Faster looping
constructs (a SmallInteger to: do:, Float to: by: do:, Interval to: by: do:,
Array do:, ByteArray do: etc.) would significantly speed up some code.

> Remove an element from a Set, and the following things can happen:
> (1) The element is already absent; nothing bad happens.
> (2) The element is present.  It's replaced by nil, then
#fixCollisionsFrom:
>     is called.  This can make fairly large scale rearrangements of the
>     array.  As far as I can see, this can results in elements being missed
>     or seen twice, WHICHEVER of the #do: versions is used.

You're right. I forgot about fixCollisionFrom:.   The new code is not much
saner than the last code was, on deletes.

I've always prefered hash tables that use overflow list instead of  scanning
forward, because it's easier to tell if an object isn't in the table.  And
that looping code would be saner on deletes if we used overflow lists.
Nevermind.

> There's one fairly major problem with Set>>do: which this change
> does nothing about:  the fact that iterating over a Set of n elements
> is not O(n).  This seems rather like polishing the brass on the Titanic.

Yes.  Talking about the memory leak caused by the fact that Sets grow but
never shrink was a way of broaching a related subject.

Clearly if there was a limit to the ratio between unused elements and used
ones then iterating would also have guaranteed time behavior...  Not that I
was thinking about this.  I was worried about behavior, not time.  You
wouldn't use Smalltalk if you were generally more focused on execution speed
than on having elegant programming tools.

Since Sets and Dictionaries and the like are usually added to and rarely
removed from, it's hard to be sure that allowing these hash tables to shrink
is really a win on average.

And the question I was wondering about mostely was whether putting a shrink
in could break any of the many classes derived from Set.  There ARE methods
that depend on the internal structure of the Set and grow can be overridden
etc. so other classes would probably have to be changed if Sets were allowed
to shrink as well as grow.

Sigh.

> (3) The element is not already present and there is not enough room left
>     (as defined by #fullCheck)
>     The old code will switch over to the new array, but not scan all of
it.
>     The new code will continue scanning the old array, so will not see the
>     addition or any other changes.  This is at least as much of a
"meltdown"
>     as what the old code did, and is arguably worse, in that it will hold
>     onto a dead data structure that ought to be GC-meat.

I disagree that holding on to the the old version is a meltdown at all.
The only non-messy semantics for do: (though too expensive) would be to make

aCollection do: [ :each|...

act like

aCollection copy do: [ :each |...

There's nothing messy about iterating over a snapshot of a table from a
database.  From the standpoint of functional programnming the second is much
cleaner as well.

If you've worked on database engines (I have) , you know that there are lots
of database modes each with different compromises about what data you'll see
in a transaction and about exactly when "now" is.  Databases are often run
in less than theoretically perfect modes because perfectly atomic modes are
too expensive.

So, for instance, a promise that you can add to a collection while adding to
it, and see each of it's original members exactly once, but you might not
see all of your new additions is one that's useable if only bearly.  While
the, it blows up and shows you some elements twice or more times and some
elements never is completely unworkable.

As for whether it's wasting memory by holding on to the GC-meat until the
end of the loop (I know you didn't say that), I would point out that Set
collect: works exactly the way this code did, by invoking "array do:"  As I
said I don't think holding on to an object while it's being looped over is
onerous.

Ultimately while this new code is saner for adding inside of a loop it's not
easy to work with...  So yes, it's not great.

But since (the updated) fix doesn't slow things down, and it does act much
saner in a certain situation, I think it's a win.

Joshua Scholar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SetDo.2.cs
Type: application/octet-stream
Size: 357 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20041130/90f08eaa/SetDo.2.obj