[squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Dale Henrichs dale.henrichs at gemtalksystems.com
Thu Dec 10 20:27:36 UTC 2015



On 12/09/2015 04:31 PM, Levente Uzonyi wrote:
> On Wed, 9 Dec 2015, Dale Henrichs wrote:
>
>>
>>
>> On 12/09/2015 12:44 AM, Stephan Eggermont wrote:
>>> On 08-12-15 22:35, Dale Henrichs wrote:
>>>> What I meant is that you can't _always_ use the code point for
>>>> collation, i.e., sorting based on the value of code points is not 
>>>> always
>>>> correct[1].
>>>
>>> I have given up on universal sorting when I learned that dutch 
>>> libraries sorting of author names depends on the country of origin 
>>> of the author. So if Jan van Beek is dutch he will be sorted under 
>>> B, while if he's belgian under V. I haven't checked what happens if 
>>> the author emigrates, or changes nationality...
>>>
>>> Stephan
>>>
>> Well, with ICU (and GemStone's implementation) you can choose which 
>> collator to use (Country specific) at the image level or on a 
>> comparison by comparison bases ... for example for an indexed 
>> collection (Unicode) Strings, you can choose the collator to use for 
>> that particular index ... so while it's true that universal sorter is 
>> not possible, it is possible to choose a collator that will satisfy a 
>> particlar customer ....
>
> I expect my image to compare strings using the codepoint-based 
> (+language tags) lexicographical method, because it's simple, 
> deterministic and fast.
> Imagine having failing tests just because your image uses different 
> default comparison methods based on some (external) parameter...
> It's also a nightmare to find out why your program is slow on some 
> machine, while it's fast on another.

When we implemented the Unicode support in GemStone we preserved the 
legacy string classes and  their legacy behavior ... We added new 
Unicode* classes with the new collator-based behavior for sorting and 
comparison ... That way legacy applications (and legacy) tests were not 
impacted by the choice of  collator ... And folks could choose whether 
or not their application would benefit by the use of the new Unicode* 
classes....

The ICU library performance is actually comparable to our original 
implementations, so there isn't a noticeable performance difference - we 
built the support into our vm and if folks are interested in some of the 
gory technical details, we'd be willing to share our experience, as 
there are several things that we did to minimize potential performance 
impacts ---

Dale




More information about the Squeak-dev mailing list