I asked: > Why should addLast: or addFirst: be given a complete veto?
and gave implementations of #addFirst: and #addLast: which complain only when the operation would violate the class invariant.
Stephen Rudlof replied: The short answer ---------------- The current situation for SortedCollections - is inconsistent: >>addFirst: is forbidden, >>addLast: not;
That was understood and assumed in my question.
- semantically incorrect: using >>addLast: easily leads to a not sorted SortedCollection. That also was understood and assumed.
My question is not "what is broken about the present situation", but "why not allow these operations to proceed whenever they can do so without breaking the class invariant?"
To expand on this a little: why not give programmers a way to build a sorted collection in linear time when they are able to provide data in the right order? The longer answer ----------------- Prenote: your suggestion is *semantically* correct. But there are psychological/practical reasons against introducing it. Why does a programmer use #addLast:? He wants to build a Collection in an explicit order. Why does a programmer use a SortedCollection? He wants to have a Collection, which automatically ensures an order of added elements. There are several attitudes a class designer might take. The attitude I prefer is - operations should never be allowed to break class invariants - we should avoid introducing operations that often don't work - but when we are stuck with an operation that CAN be allowed to proceed some of the time we should not be strict bondage and discipline nannies and stop sensible programmers doing sensible things. The attitude here is - I am like God; I know exactly how programmers think and they mustn't. I can foresee every possible use and know which ones make sense. (OK, this is a caricature, but it is a recognisable one.)
In this particular instance, it is possible for one and the same programmer, at a single instant in time, *BOTH* to want to build a collection in a particular order *AND* to want that collection to be a sorted collection.
For example, support I have a sorted collection of numbers, call it "fred": fred := #(3 1 4 1 5 9 2 6 5) asSortedCollection. Now suppose I want to add 1 to every element. THIS IS AN ORDER-PRESERVING TRANSFORMATION. So
jim := SortedCollection new. fred do: [:each | jim addLast: each + 1].
will do the job, IF I am allowed to use addLast: when it is safe. (Yes, I know I can do fred+1, but the result belongs to the wrong class.)
Using >>addLast: for a SortedCollection means to try to give an explicit order to something, which has an automatically 'built in' one. This mostly is a programmer error.
I am not sure how to interpret "mostly" here. If it is an empirical statement about how often it does happen, then I would be intrigued to know where the evidence came from.
In the case where the two orders AGREE, which is the only case my versions of #addFirst: and #addLast: allow, it is not an error. In the case (of as far as I am aware completely unknown frequency) where the two orders do not agree, the error will still be reported.
And it violates encapsulation: I have to know, which built in order the SortedCollection has, to use >>addLast: correctly along your suggestion.
No, the sorting order of a sorted collection is part of its ***PUBLIC*** interface. Note that SortedCollection>>sortBlock is in the 'accessing' category, not the 'private' category.
By using >>add: instead, there would be no problem. Yes there would. There would be an *EFFICIENCY* problem.
Note that I too can claim a psychological reason why addLast: should be used in this case rather than add:. The reason is that in Set #add: means "add only if absent", whereas whenever addLast: is allowed it will make the collection one element bigger.
So there remains the performance thing: using SortedCollection>>add: multiple times means, that each object has to be sorted in immediately. It is probably more performant to have a private add method, which just adds an element, while leaving the SortedCollection in an inconsistent state, and >>reSort it afterwards, when all elements have been added.
"performant" is, to the best of my knowledge, not an English word. I certainly haven't been able to find it in any of the paper or electronic dictionaries I've just checked. Try "faster".
That has been my idea of introducing the 'bypass' method for clients which really know about the internals. But the ordering used by a sorted collection is most definitely NOT a fact about its internals. It is in some respects the most public thing there is about such an object.
Part of my point is that you don't *need* a bypass method when you have addLast: and could make it respect the class invariant.
BTW: In the ANSI standard (revision 1.9) *protocol* <SortedCollection> doesn't conform to protocol <OrderedCollection>, which is the only one defining >>addFirst: and >>addLast:. This makes sense to me. The ANSI argument was shot down in smoke and billowing flames when I tried to apply it to #removeAll:, I don't see why it should have any greater force here. Nothing in the ANSI standard prevents you *adding* methods to standard classes; if it did, almost all of the methods in Object would have to find another home. Implementation notes -------------------- To the private SortedCollection>>addLastWithoutSorting: there could also be added - >>addWithoutSorting: and - >>addFirstWithoutSorting:; for - very few! - clients using the *internals*.
After using one of these methods >>reSort should be called to ensure consistency. Note: I haven't proved the implicit assumption here, that >>reSort'ing an already sorted SortedCollection is faster than adding elements in arbitrary order.
I suggest allowing methods to proceed that DON'T break the class invariant, and this is spurned as breaking encapsulation, even though the methods are concerned solely with matters in the public interface.
Instead, as a better way, Rudlof proposes adding methods that DO break the class invariant, and DON'T offer any efficiency advantage. I think it's time to talk about lazy data structures.
We could add elements fast to SortedCollections like this.
1. The internal representation is an array partitioned thus: +----+------------+--------+----+ |1 |a z| x| n| +----+------------+--------+----+
elements 1..a-1 : all nil elements a..z : sorted according to sortBlock elements z+1..x : added elements not sorted yet elements x+1..n : all nil
Where I have written "a" and "x" you should understand the existing instance variables of OrderedCollection (inherited by SortedCollection) being used instead; the "z" instance variable is new.
2. add: anItem ^super addLast: anItem addFirst: anItem self shouldNotImplement addLast: anItem self shouldNotImplement addAll: aCollection ^super addAllLast: aCollection addAllFirst: aCollection self shouldNotImplement addAllLast: aCollection self shouldNotImplement
No, I'm not being inconsistent here. I'm saying "let someone build a sorted collection in linear time if they can provide the elements in sorted order." With this kind of implementation, #add: _is_ adequate for the job.
3. A strictly private method would exist:
pvtEnsureAllSorted "rather like current addAll:" x = z ifTrue: [^array]. self pvtSortAddedElements. "sorts z+1..x" self pvtMerge. "merges a..z with z+1..x" z := x. ^array
Ideally pvtSortAddedElements would use a method which is linear when the elements are sorted already; Dijkstra's smooth-sort is just one of many algorithms which are good at that, bottom-up natural merge is easier to understand and works very well in practice.
Using bottom-up natural merge, if you add M elements to a Sorted Collection having N elements and then look at the result, the adds would cost O(M) time, pvtEnsureAllSorted would cost O(M) for pvtSortAddedElements and O(M+N) for pvtMerge, for a total O(N+M). Building an entire collection from N=0 would thus be O(M).
4. Other methods would call pvtEnsureAllSorted, e.g.
at: index self assert: [index between: 1 and: x-a+1]. ^self pvtEnsureAllSorted at: index-a+1
do: aBlock self pvtEnsureAllSorted. ^super do: aBlock
5. Setting the sortblock would simply enlarge the unsorted area:
sortBlock: aBlock sortBlock := aBlock ifNil: [nil] ifNotNil: [aBlock fixTemps]. z := a-1.
This is the clean way to do it. The class invariant is weakened from "the elements are always sorted" to "the elements are always sorted whenever anyone else looks; when noone else is looking they are in a convenient configuration for sorting." From the outside, it looks as though the elements are always sorted.
While anyone's looking at SortedCollection, consider
#(3 1 4 1 5 9 2 6 5 3 5) asSortedCollection reversed The result prints as a SortedCollection (9 6 5 5 5 4 3 3 2 1 1) but has sortBlock nil, so the order disagrees with the sortBlock. #reversed does make sense for SortedCollections; it ought to return a collection with the opposite comparison.
Dear Richard,
Richard A. O'Keefe wrote:
I asked:
Why should addLast: or addFirst: be given a complete veto?
and gave implementations of #addFirst: and #addLast: which complain only when the operation would violate the class invariant.
Stephen Rudlof replied: The short answer
The current situation for SortedCollections
- is inconsistent: >>addFirst: is forbidden, >>addLast: not;
That was understood and assumed in my question.
- semantically incorrect: using >>addLast: easily leads to a not sorted
SortedCollection.
That also was understood and assumed.
My question is not "what is broken about the present situation", but "why not allow these operations to proceed whenever they can do so without breaking the class invariant?"
To expand on this a little: why not give programmers a way to build a sorted collection in linear time when they are able to provide data in the right order?
I'm not against this. But please don't use addLast: for this. Use something like >>addLastAlreadyInOrder: instead. And something like
addAllAlreadySorted:, >>addAllProbablyAlreadySorted: or whatever for
adding whole collections.
The longer answer
Prenote: your suggestion is *semantically* correct. But there are psychological/practical reasons against introducing it.
Why does a programmer use #addLast:? He wants to build a Collection in an explicit order. Why does a programmer use a SortedCollection? He wants to have a Collection, which automatically ensures an order of added elements.
There are several attitudes a class designer might take.
The attitude I prefer is
- operations should never be allowed to break class invariants
Agreed. (see also my responses to Andreas Raab)
- we should avoid introducing operations that often don't work
Agreed: this is my point regarding >>addLast: introduced for SortedCollections: it always works for OrderedCollections, but only in very special cases, if applied to SortedCollections.
- but when we are stuck with an operation that CAN be allowed to proceed some of the time we should not be strict bondage and discipline nannies and stop sensible programmers doing sensible things.
This is a contradiction to the previous point: 'proceed *some* of the time' ('*' by me) violates the goal to 'avoid introducing operations that often don't work'.
The attitude here is
- I am like God; I know exactly how programmers think and they mustn't.
I know what I expect from calling >>addLast:.
I can foresee every possible use and know which ones make sense. (OK, this is a caricature, but it is a recognisable one.)
In this particular instance, it is possible for one and the same programmer, at a single instant in time, *BOTH* to want to build a collection in a particular order *AND* to want that collection to be a sorted collection.
For example, support I have a sorted collection of numbers, call it "fred": fred := #(3 1 4 1 5 9 2 6 5) asSortedCollection. Now suppose I want to add 1 to every element. THIS IS AN ORDER-PRESERVING TRANSFORMATION. So
jim := SortedCollection new. fred do: [:each | jim addLast: each + 1].
will do the job, IF I am allowed to use addLast: when it is safe. (Yes, I know I can do fred+1, but the result belongs to the wrong class.)
Good example! Introduce and use something like >>addLastAlreadyInOrder: instead.
Using >>addLast: for a SortedCollection means to try to give an explicit order to something, which has an automatically 'built in' one. This mostly is a programmer error.
I am not sure how to interpret "mostly" here. If it is an empirical statement about how often it does happen, then I would be intrigued to know where the evidence came from.
It's not an empirical statement. For me it is a convention to expect an OrderedCollection if using >>addLast: (also see below).
In the case where the two orders AGREE, which is the only case my versions of #addFirst: and #addLast: allow, it is not an error. In the case (of as far as I am aware completely unknown frequency) where the two orders do not agree, the error will still be reported.
And it violates encapsulation: I have to know, which built in order the SortedCollection has, to use >>addLast: correctly along your suggestion.
No, the sorting order of a sorted collection is part of its ***PUBLIC*** interface. Note that SortedCollection>>sortBlock is in the 'accessing' category, not the 'private' category.
Agreed: 'encapsulation' is the wrong term here.
My point is: if I'd use >>addLast: I'd expect an OrderedCollection, *not* a SortedCollection. And ANSI supports this kind of view.
By using >>add: instead, there would be no problem.
Yes there would. There would be an *EFFICIENCY* problem.
Note that I too can claim a psychological reason why addLast: should be used in this case rather than add:. The reason is that in Set #add: means "add only if absent", whereas whenever addLast: is allowed it will make the collection one element bigger.
So there remains the performance thing: using SortedCollection>>add: multiple times means, that each object has to be sorted in immediately. It is probably more performant to have a private add method, which just adds an element, while leaving the SortedCollection in an inconsistent state, and
reSort it afterwards, when all elements have been added.
"performant" is, to the best of my knowledge, not an English word. I certainly haven't been able to find it in any of the paper or electronic dictionaries I've just checked. Try "faster".
I'm wondering, because Leo knows it: http://dict.leo.org/?search=performant&searchLoc=0&relink=on&spe... But it may be wrong: you are the native speaker!
That has been my idea of introducing the 'bypass' method for clients which really know about the internals.
I don't like this (not exactly mine, see also my replies to Andreas Raab) idea anymore.
But the ordering used by a sorted collection is most definitely NOT a fact about its internals. It is in some respects the most public thing there is about such an object.
Agreed.
Part of my point is that you don't *need* a bypass method when you have addLast: and could make it respect the class invariant.
BTW: In the ANSI standard (revision 1.9) *protocol* <SortedCollection> doesn't conform to protocol <OrderedCollection>, which is the only one defining >>addFirst: and >>addLast:. This makes sense to me.
The ANSI argument was shot down in smoke and billowing flames when I tried to apply it to #removeAll:, I don't see why it should have any greater force here. Nothing in the ANSI standard prevents you *adding* methods to standard classes; if it did, almost all of the methods in Object would have to find another home.
Agreed. But I stay at my assumption to expect an OrderedCollection if using
addLast:. And if you introduce methods *** with the same name *** in other
protocols you break conventions (and ANSI is also a convention).
Implementation notes
To the private SortedCollection>>addLastWithoutSorting: there could also be added
addWithoutSorting: and addFirstWithoutSorting:;for - very few! - clients using the *internals*.
After using one of these methods >>reSort should be called to ensure consistency. Note: I haven't proved the implicit assumption here, that >>reSort'ing an already sorted SortedCollection is faster than adding elements in arbitrary order.
I suggest allowing methods to proceed that DON'T break the class invariant, and this is spurned as breaking encapsulation, even though the methods are concerned solely with matters in the public interface.
Instead, as a better way, Rudlof proposes adding methods that DO break the class invariant,
I've corrected this: but there *is* a client in danger breaking it (see reply to Andreas Raab), since it uses internals. And I've just continued to allow this instead of fixing this, too.
and DON'T offer any efficiency advantage.
Haven't measured this (regarding these clients using internal knowledge about SortedCollection).
<snipped a very interesting proposal to be handled in another reply>
Greetings,
Stephan
"performant" is, to the best of my knowledge, not an English word. I certainly haven't been able to find it in any of the paper or electronic dictionaries I've just checked. Try "faster".
I'm wondering, because Leo knows it: http://dict.leo.org/?search=performant&searchLoc=0&relink=on&s
pellToler=standard§Hdr=off&tableBorder=1> &cmpType=relaxed&lang=en
But it may be wrong: you are the native speaker!
I think "performant" is Denglish - e.g., an english word put into german context (same for "geupdated" or similar expressions). In German it means "[something] performs well, has good performance" and is therefore "performant".
Cheers, - Andreas
On Fri, 27 Sep 2002, Andreas Raab wrote:
"performant" is, to the best of my knowledge, not an English word.
I think "performant" is Denglish - e.g., an english word put into german
It's a perfectly good french adjective, and means precisely...
"[something] performs well, has good performance" and is therefore "performant".
...that. If it isn't part of English, it should be. (Who do we petition? L'Académie anglaise ?? ;-)
Tchao,
Ian
Richard A. O'Keefe wrote: <...>
While anyone's looking at SortedCollection, consider
#(3 1 4 1 5 9 2 6 5 3 5) asSortedCollection reversed
The result prints as a SortedCollection (9 6 5 5 5 4 3 3 2 1 1) but has sortBlock nil, so the order disagrees with the sortBlock. #reversed does make sense for SortedCollections; it ought to return a collection with the opposite comparison.
There may be a problem: if you apply >>reversed multiple times the opposite comparison sortBlock becomes bigger and bigger, if the implementation just inverts it by constructing a new one with sending #not to the old. There are other implementations possible, though.
Greetings,
Stephan
Richard,
your proposal implementing SortedCollection with 'lazy data structures' is very interesting. It removes the burden of all thinking about adding to a SortedCollection performantly ( ;-) ).
Any volunteer out there to give it a try?
One question is: How is the relation between changes of the SortedCollection and looking
at: it, since some time is spend for the checks there. If accessing it
outweights the changes by magnitudes there could be a *small* performance penalty.
But I guess 'normally' the proposal gives better performance and it is always possible to convert it into a plain old OrderedCollection for accessing it very often, if someone worries about this.
Greetings,
Stephan
Richard A. O'Keefe wrote: <...>
I think it's time to talk about lazy data structures.
We could add elements fast to SortedCollections like this.
The internal representation is an array partitioned thus: +----+------------+--------+----+ |1 |a z| x| n| +----+------------+--------+----+
elements 1..a-1 : all nil elements a..z : sorted according to sortBlock elements z+1..x : added elements not sorted yet elements x+1..n : all nil
Where I have written "a" and "x" you should understand the existing instance variables of OrderedCollection (inherited by SortedCollection) being used instead; the "z" instance variable is new.
add: anItem ^super addLast: anItem addFirst: anItem self shouldNotImplement addLast: anItem self shouldNotImplement addAll: aCollection ^super addAllLast: aCollection addAllFirst: aCollection self shouldNotImplement addAllLast: aCollection self shouldNotImplement
No, I'm not being inconsistent here. I'm saying "let someone build a sorted collection in linear time if they can provide the elements in sorted order." With this kind of implementation, #add: _is_ adequate for the job.
A strictly private method would exist:
pvtEnsureAllSorted "rather like current addAll:" x = z ifTrue: [^array]. self pvtSortAddedElements. "sorts z+1..x" self pvtMerge. "merges a..z with z+1..x" z := x. ^array
Ideally pvtSortAddedElements would use a method which is linear when the elements are sorted already; Dijkstra's smooth-sort is just one of many algorithms which are good at that, bottom-up natural merge is easier to understand and works very well in practice.
Using bottom-up natural merge, if you add M elements to a Sorted Collection having N elements and then look at the result, the adds would cost O(M) time, pvtEnsureAllSorted would cost O(M) for pvtSortAddedElements and O(M+N) for pvtMerge, for a total O(N+M). Building an entire collection from N=0 would thus be O(M).
Other methods would call pvtEnsureAllSorted, e.g.
at: index self assert: [index between: 1 and: x-a+1]. ^self pvtEnsureAllSorted at: index-a+1
do: aBlock self pvtEnsureAllSorted. ^super do: aBlock
Setting the sortblock would simply enlarge the unsorted area:
sortBlock: aBlock sortBlock := aBlock ifNil: [nil] ifNotNil: [aBlock fixTemps]. z := a-1.
This is the clean way to do it. The class invariant is weakened from "the elements are always sorted" to "the elements are always sorted whenever anyone else looks; when noone else is looking they are in a convenient configuration for sorting." From the outside, it looks as though the elements are always sorted.
<...>
squeak-dev@lists.squeakfoundation.org