[FIX] SortedCollectionFix-sr

Richard A. O'Keefe ok at cs.otago.ac.nz
Fri Sep 27 01:39:59 UTC 2002


I asked:
	> Why should addLast: or addFirst: be given a complete veto?

and gave implementations of #addFirst: and #addLast: which
complain only when the operation would violate the class invariant.

Stephen Rudlof replied:
	The short answer
	----------------
	The current situation for SortedCollections
	- is inconsistent: >>addFirst: is forbidden, >>addLast: not;

That was understood and assumed in my question.

	- semantically incorrect: using >>addLast: easily leads to a not sorted
	SortedCollection.
	
That also was understood and assumed.

My question is not "what is broken about the present situation",
but "why not allow these operations to proceed whenever they can
do so without breaking the class invariant?"

To expand on this a little:  why not give programmers a way to build
a sorted collection in linear time when they are able to provide data
in the right order?
	
	The longer answer
	-----------------
	
	Prenote: your suggestion is *semantically* correct. But there are
	psychological/practical reasons against introducing it.
	
	Why does a programmer use #addLast:? He wants to build a
	Collection in an explicit order.
	Why does a programmer use a SortedCollection? He wants to have a Collection,
	which automatically ensures an order of added elements.
	
There are several attitudes a class designer might take.
The attitude I prefer is
 - operations should never be allowed to break class invariants
 - we should avoid introducing operations that often don't work
 - but when we are stuck with an operation that CAN be allowed to
   proceed some of the time we should not be strict bondage and
   discipline nannies and stop sensible programmers doing sensible
   things.
The attitude here is
 - I am like God; I know exactly how programmers think and they mustn't.
   I can foresee every possible use and know which ones make sense.
   (OK, this is a caricature, but it is a recognisable one.)

In this particular instance, it is possible for one and the same
programmer, at a single instant in time,
*BOTH* to want to build a collection in a particular order
*AND* to want that collection to be a sorted collection.

For example, support I have a sorted collection of numbers,
call it "fred":
    fred := #(3 1 4 1 5 9 2 6 5) asSortedCollection.
Now suppose I want to add 1 to every element.
THIS IS AN ORDER-PRESERVING TRANSFORMATION.
So

    jim := SortedCollection new.
    fred do: [:each | jim addLast: each + 1].

will do the job, IF I am allowed to use addLast: when it is safe.
(Yes, I know I can do fred+1, but the result belongs to the wrong class.)

	Using >>addLast:  for a SortedCollection means to try to give an
	explicit order to something, which has an automatically 'built
	in' one.  This mostly is a programmer error.

I am not sure how to interpret "mostly" here.
If it is an empirical statement about how often it does happen,
then I would be intrigued to know where the evidence came from.

In the case where the two orders AGREE, which is the only case my
versions of #addFirst: and #addLast: allow, it is not an error.
In the case (of as far as I am aware completely unknown frequency)
where the two orders do not agree, the error will still be reported.

	And it violates encapsulation:  I have to know, which built in
	order the SortedCollection has, to use >>addLast:  correctly
	along your suggestion.

No, the sorting order of a sorted collection is part of its ***PUBLIC***
interface.  Note that SortedCollection>>sortBlock is in the 'accessing'
category, not the 'private' category.

	By using >>add:  instead, there would be no problem.
	
Yes there would.  There would be an *EFFICIENCY* problem.

Note that I too can claim a psychological reason why addLast: should
be used in this case rather than add:.  The reason is that in Set
#add: means "add only if absent", whereas whenever addLast: is allowed
it will make the collection one element bigger.

	So there remains the performance thing:  using
	SortedCollection>>add:  multiple times means, that each object
	has to be sorted in immediately.  It is probably more performant
	to have a private add method, which just adds an element, while
	leaving the SortedCollection in an inconsistent state, and
	>>reSort it afterwards, when all elements have been added.

"performant" is, to the best of my knowledge, not an English word.
I certainly haven't been able to find it in any of the paper or
electronic dictionaries I've just checked.  Try "faster".

	That has been my idea of introducing the 'bypass' method for
	clients which really know about the internals.
	
But the ordering used by a sorted collection is most definitely
NOT a fact about its internals.  It is in some respects the most
public thing there is about such an object.

Part of my point is that you don't *need* a bypass method when you
have addLast: and could make it respect the class invariant.

	BTW: In the ANSI standard (revision 1.9) *protocol* <SortedCollection>
	doesn't conform to protocol <OrderedCollection>, which is the only one
	defining >>addFirst: and  >>addLast:. This makes sense to me.
	
The ANSI argument was shot down in smoke and billowing flames when I tried
to apply it to #removeAll:, I don't see why it should have any greater force
here.  Nothing in the ANSI standard prevents you *adding* methods to
standard classes; if it did, almost all of the methods in Object would
have to find another home.
	
	Implementation notes
	--------------------
	
	To the private SortedCollection>>addLastWithoutSorting: there could also be
	added
	- >>addWithoutSorting: and
	- >>addFirstWithoutSorting:;
	for - very few! - clients using the *internals*.

	After using one of these methods >>reSort should be called to
	ensure consistency.  Note:  I haven't proved the implicit
	assumption here, that >>reSort'ing an already sorted
	SortedCollection is faster than adding elements in arbitrary
	order.

I suggest allowing methods to proceed that DON'T break the class
invariant, and this is spurned as breaking encapsulation, even though
the methods are concerned solely with matters in the public interface.

Instead, as a better way, Rudlof proposes adding methods that DO break
the class invariant, and DON'T offer any efficiency advantage.
	
I think it's time to talk about lazy data structures.

We could add elements fast to SortedCollections like this.

1.  The internal representation is an array partitioned thus:
	+----+------------+--------+----+
	|1   |a          z|       x|   n|
	+----+------------+--------+----+

    elements 1..a-1 : all nil
    elements a..z   : sorted according to sortBlock
    elements z+1..x : added elements not sorted yet
    elements x+1..n : all nil

    Where I have written "a" and "x" you should understand the existing
    instance variables of OrderedCollection (inherited by SortedCollection)
    being used instead; the "z" instance variable is new.

2.  add: anItem
	^super addLast: anItem
    addFirst: anItem
	self shouldNotImplement
    addLast: anItem
	self shouldNotImplement
    addAll: aCollection
	^super addAllLast: aCollection
    addAllFirst: aCollection
	self shouldNotImplement
    addAllLast: aCollection
        self shouldNotImplement

    No, I'm not being inconsistent here.  I'm saying "let someone build
    a sorted collection in linear time if they can provide the elements
    in sorted order."  With this kind of implementation, #add: _is_
    adequate for the job.

3.  A strictly private method would exist:

    pvtEnsureAllSorted		   "rather like current addAll:"
	x = z ifTrue: [^array].
	self pvtSortAddedElements. "sorts z+1..x"
	self pvtMerge.             "merges a..z with z+1..x"
	z := x.
	^array

    Ideally pvtSortAddedElements would use a method which is linear when
    the elements are sorted already; Dijkstra's smooth-sort is
    just one of many algorithms which are good at that, bottom-up
    natural merge is easier to understand and works very well in practice.

    Using bottom-up natural merge, if you add M elements to a Sorted
    Collection having N elements and then look at the result,
    the adds would cost O(M) time,
    pvtEnsureAllSorted would cost O(M) for pvtSortAddedElements
    and O(M+N) for pvtMerge, for a total O(N+M).
    Building an entire collection from N=0 would thus be O(M).

4.  Other methods would call pvtEnsureAllSorted, e.g.

    at: index
	self assert: [index between: 1 and: x-a+1].
	^self pvtEnsureAllSorted at: index-a+1

    do: aBlock
	self pvtEnsureAllSorted.
	^super do: aBlock

5.  Setting the sortblock would simply enlarge the unsorted area:

    sortBlock: aBlock
        sortBlock := aBlock ifNil: [nil] ifNotNil: [aBlock fixTemps].
        z := a-1.


This is the clean way to do it.  The class invariant is weakened from
"the elements are always sorted" to "the elements are always sorted
whenever anyone else looks; when noone else is looking they are in a
convenient configuration for sorting."  From the outside, it looks as
though the elements are always sorted.


While anyone's looking at SortedCollection, consider

    #(3 1 4 1 5 9 2 6 5 3 5) asSortedCollection reversed
The result prints as
    a SortedCollection (9 6 5 5 5 4 3 3 2 1 1)
but has sortBlock nil, so the order disagrees with the sortBlock.
#reversed does make sense for SortedCollections; it ought to return
a collection with the opposite comparison.



More information about the Squeak-dev mailing list