[squeak-dev] Re: #withoutDuplicates on Collection?

15 Jun 2023

      Hi all,
I originally brought up this question while revising a paragraph in Squeak by Example:
...
Collection>>asSet offers us a convenient way to eliminate duplicates from a collection:
  {Color black . Color white . (Color red + Color blue + Color green)} asSet size --> 2
  The result of the example is 2, as the color white was included twice in the collection. Once as the result of Color white, and once as the result of the combination of red, blue, and green.

However, if you are working with a sequenceable collection and want to preserve its type or order, you can use SequenceableCollection>>withoutDuplicates instead.

I stroke me that I had to write "if you are working with a sequenceable collection" here, and it still strikes me today. #asSet should not be the only efficient way to deduplicate a bag, heap, or whatever. If I want deduplication, I should not have to know the type of my collection if it's a variant type.
The property of having duplicates is completely invariant from the property of being sequencable unless you think about how an implementation of eliminating duplicates might look like.
I think that this discussion is being overshadowed by another issue, the iteration semantics of dictionaries. Because Dictionary>>#do: ignores the keys, there is always a sharp break when applying iteration logic from another collection to dictionaries. Thus, I think we should not discuss Collection>>#withoutDuplicates through the example of dictionaries. But given their existing iteration semantics, I'm with Vanessa on the behavior of Dictionary>>#withoutDuplicates: Keys are only metadata not relevant for iteration, so they should also not influence deduplication.
...
Yes but to remove duplicates you then need to keep only one key among the existing ones - how do you choose?
That's as undefined as the iteration order of #do:, or the answer of #anyOne is for non-sequenceable collections. If you want to iterate or deduplicate your bag, order it first. :-)
Best,
Christoph
________________________________
Von: Tobias Pape Das.Linux@gmx.de
Gesendet: Mittwoch, 14. Juni 2023 11:58:54
An: The general-purpose Squeak developers list
Betreff: [squeak-dev] Re: #withoutDuplicates on Collection?
Hi
...
On 13. Jun 2023, at 19:18, Vanessa Freudenberg vanessa@codefrau.net wrote:
I like Chris' argument and example.
And even for Dictionaries the semantics are pretty clear – the keys already have no duplicates after all, so obviously it would apply to the values.
I'm not on board with this one.
I'd expect to be #withoutDuplicates  be a '^self'.
That said, most messages without explicit *Key are on values in dict, so there's that
-t
...
No harm in moving it up.
Vanessa
On Tue, Jun 13, 2023 at 2:39 AM Marcel Taeumel via Squeak-dev squeak-dev@lists.squeakfoundation.org wrote:
That said, I am not "super against" moving #withoutDuplicates up to Collection. In contrast to #joinSeparatedBy:, I do not see the issue of "immediate surprise" here when having a non-sequenceable collection at hand. :-) Only maybe "probable surprise" or  "eventual surprise" ^^
Best,
Marcel
...
Am 13.06.2023 09:20:35 schrieb Marcel Taeumel marcel.taeumel@hpi.de:
Hi all --
Chris (cmm) is arguing for a kind of "polymorphic convenience", focusing on a single operation where programmers do not care about specific properties of the collection at hand. The definition of "being a duplicate" is probably something like "a = b" and thus relying on the implementation of #=. I think that Christoph (ct) has a similar perspective here.
Vanessa (codefrau) is arguing for considering a broader context of what programmers are trying to achieve and how they think about their collections at hand. Here, "being sequenceable" seems to be somehow connected to figuring out what "being a duplicate in a container" means. Thus, it would be strange if #withoutDuplicates worked but a following "uniqueStuff first" would raise an error in certain cases. Additionally, "being a duplicate" seems to be tricky for more complex structures such as a Dictionary. There, "duplicate keys" are typically forbidden while "duplicate values" are okay. This property makes #withoutDuplicates a little bit more challenging to understand and use, depending on the kind of collection.
At the moment, programmers have to convert their collection to a sequenceable one to then be able to use #withoutDuplicates. This requires extra knowledge about the collection at hand. We have checks such as #isSequenceable for that to avoid unnecessary conversions. However, following Vanessa's train of thought, that extra knowledge would also be required when #withoutDuplicates was moved to the top. Specifically, programmers would have to think about the next operations they want to perform. Then, a conversion might be necessary anyway.
As I cannot see any improvement of changing the status quo in this regard, I would argue not to move #withoutDuplicates to the top but keep it where it is.
Best,
Marcel
...
Am 13.06.2023 05:22:53 schrieb Chris Muller asqueaker@gmail.com:
Hi Christoph,
I was just wondering why we only define #withoutDuplicates on SequenceableCollection, since it does not depend on the order of the receiver and could be applied to other types of collections as well. For instance, Set could override it with ^self copy, Bags would be naturally covered as well, etc.
What do you think?
+1.  IMO, the implementation of methods as semantically abstract as "collections" should drive the decision of where they reside.  #size and #do: are the core methods of Collection.  Any operation that can rely solely on those should reside in Collection.
When useful abstract operations are needlessly stuck in subclasses like SequenceableCollection, that, itself, becomes a question of semantics.  And of design and usability, too.  Especially when one discovers they have a use for it in a non-Sequenceable, "Why," becomes the always- distracting first question.  Often, people will simply re-implement something else rather than take a detour to lobby to move it up or modify the core library, thus continuing to reinforce a false notion of, "See?  It's not needed except for Sequenceables.."
I lost a similar argument several years ago with #joinSeparatedBy:.  I was generating and processing hundreds of thousands of query objects, each having a collection of named arguments (key / value pairs), whose names (keys) simply had to be unique.  And because it absolutely did not matter what order the arguments were printed on the output stream, a Dictionary was the obvious choice for that collection.  Nevertheless, the naysayers argued that their own personal lack of context to such a use-case meant, "random result order makes such a feature questionable".
It is, until it isn't.  The only reason #do: on Dictionary makes sense (e.g., "should it enumerate the 'values', or the 'associations'?) is due to our own experience as Smalltalkers of using it.  Every operation on Dictionary's has non-obvious semantics to anyone not familiar with Smalltalk.  This is why the implementation should weigh most heavily in such decisions, with semantics and personal experiences being secondary weights.
Best,
  Chris