[BUG] Set>>collect:

Richard A. O'Keefe ok at cs.otago.ac.nz
Sun Feb 16 22:58:31 UTC 2003


I wrote:

	> The essential point about #collect: is that the transformation block
	> does NOT in general return objects of the same type that it is passed.

The context here is
    aCollection collect: whateverBlockTheProgrammerWants

Bill Spight <bspight at pacbell.net> wrote:

	Well, in Squeak it seems to, in general. There are a few exceptions.
	
I find this response puzzling.  _In Squeak_ itself is not the issue.
#collect: is for other programmers to use.

Consider an actual example.  The collection is an OrderedCollection
of strings.  The strings have the form
    <location><space><word><space><stem>@<part-of-speech>
It so happens that I want the stem and part-of-speech.
Here goes:

    morphologyFile collect: [:eachLine |
	|tokens|
	tokens := eachLine findTokens: ' @'.
	Array with: tokens third with: tokens fourth]

Block input: a String
Block result: an Array

Consider almost any example where you are extracting information from
an XML document (as opposed to converting it to another XML document).
For example, given

    <table>
      <tr><td>Passage1</td><td>Count11</td>...<td>Count1c</td></tr>
      ...
      <tr><td>Passager</td><td>Countr1</td>...<td>Countrc</td></tr>
    </table>

We want to extract the counts as an array of arrays to do a chi-squared.

    table children collect: [:eachRow |
        eachRow children allButFirst collect: [:eachCell |
            Integer readFrom: eachCell text ]]

outer block input: an XML node
outer block result: an Array
inner block input: an XML node
inner block result: an Integer

(This is with respect to an XML implementation where XMLNode>>collect:
returns an Array, not another XMLNode.  And the reason it does that is
precisely that the block might return results that are forbidden as
children of XMLNodes.)

	As for the case of the code in Set, Set in Squeak has 4 children:
	Dictionary, Identity Set, Pluggable Set, and Weak Set. Dictionary and
	Weak Set override collect: So the question comes down to Pluggable Set
	vs. Identity Set. If it is desirable to have PluggableSet>>collect:
	create a Set rather than a Pluggable Set, can't you take care of it in
	PluggableSet, and leave the general behavior the same in both Set and
	IdentitySet?
	
Yes, but the general behaviour we *WANT* for IdentitySet is precisely
the general behaviour we have *NOW*.  The result of #select: or #reject:
should be the same kind of set that we started with, because it has precisely
the same kind of elements that we started with.  The Set implementations of
#reject: and #select: (inherited from Collection) *do* use self species,
and it *does* make sense for every kind of set.

The Set implementation of #collect: does *not* use self species.

Now, let's look at Set>>collect: and at what it *would* have inherited
had it not been overridden:

    Set>>
    collect: aBlock
	|newSet|
	newSet := Set new: self size.
	array do: [:each | each ifNotNil: [newSet add: (aBlock value: each)]].
	^newSet

    Collection>>
    collect: aBlock
	|newCollection|
	newCollection := self species new.
	self do: [:each | newCollection add: (aBlock value each)].
	^newCollection

What exactly is the difference between them?
(1) The iteration is differently expressed, but exactly the same elements
    would be visited in the same order, and exactly the same results would
    be added to the result in the same order, so that's not it.

(2) The new collection is created with a default size in Collection and
    just the right size in Set, but Sets (including IdentitySets and
    PluggableSets) expand, and the expansion isn't _that_ costly.
    Had this been the main intended effect, the code could have read
	newSet := self species new: self size.
    But it doesn't.

(3) The principal difference is the use of Set new rather than self species
    new.  This is the only difference which has any externally visible
    consequences for the result.

So we see that the reason that there _is_ a Set>>collect: in the first place
appears to be precisely in order to AVOID using 'self species new'.  It's a
bug avoidance measure.  I have explained what the bug is.

By the way, PluggableSet appears to be a Johnny-come-lately.  It's not in
ANSI Smalltalk, and it's not mentioned in some commercial Smalltalks.

The point is that having IdentitySet>>collect: return an IdentitySet
would be simply WRONG in many, if not most, cases.

	Isn't this also to some extent a question of documentation?

Everything is to _some_ extent a question of documentation.
The question is "who bears the burden"?

	The general rule is set out in the comment in Collection.

Set out rather unclearly, yes.  But you must on no account press that
little non-technical word "like" too hard.  Anything more than "ordered
if the receiver is ordered, unordered if the receiver is unordered" would
be more than it could bear.

	It would be reasonable to
	add to that comment that 3 collections are exceptions: Dictionary,
	SortedCollection, and PluggableSet.

They are by no means the only "exceptions".  IdentitySet clear has to be
one of the so-called "exceptions" too, and there are others.  I repeat,
you must not press that little word "like" too hard.  A Set is arguably
LIKE an IdentitySet.

	The natural place to look for the
	code covering those exceptions is in those classes, right?
	
Wrong.  That's not how Smalltalk works.  You would expect the code
to be an affected class OR in some superclass.  And it is.  If you want
to look for such code, you don't look just at the class, you look at
the "Protocol".
	
IdentitySet is best regarded as a special case of PluggableSet;
we have in general NO reason to be confident that the result of the
transformation block will be such that identity is the appropriate version
of equality or identityhash the appropriate hashing function.

Unlike PluggableSet, IdentitySet has a very nice property:
if IdentitySet *would* have been the appropriate result type,
if #== and #identityHash *would* have been the appropriate methods
to call, then since #= defaults to #== (in Object) and #hash defaults
to #identityHash (in Object), it is quite likely that Set will make
precisely the same "equality" judgements that IdentitySet would have.
So Set makes sense as result for IdentitySet>>collect:.



More information about the Squeak-dev mailing list