[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Tue Dec 8 21:35:46 UTC 2015

Euan,

What I meant is that you can't _always_ use the code point for 
collation, i.e., sorting based on the value of code points is not always 
correct[1].

If I'm not mistaken the fully-composed and fully-decomposed forms can 
only be used for testing the  equivalence of two strings[2] ...

The Main Algorithm[3], starts by producing a normalized form of the 
string, but the subsequent steps (produce array, form sort key and 
compare) involves table lookups among other things ....

Once you've produced a sort key for a string, the sort key does use 
"binary comparison" for collating , which is a byte by byte numeric 
comparison ...

Dale

[1] http://www.unicode.org/reports/tr10/#Common_Misperceptions
[2] http://unicode.org/reports/tr15/pdtr15.html
[3] http://www.unicode.org/reports/tr10/#Main_Algorithm

On 12/08/2015 12:22 PM, EuanM wrote:
> Dale - is that you can't depend on the value of a codepoint
> *unless the string is either in fully-composed form
> (or has just been fully-decomposed from a fully-composed form) *
>
> OR are there circumstances where even those two cases cannot be relied upon?
>
> On 8 December 2015 at 19:20, Dale Henrichs
> <dale.henrichs at gemtalksystems.com> wrote:
>>
>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>> Dale
>>>
>>> Thank you for your answer with links to the ICU library and the notes
>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>> subclass of ByteArray.
>>>
>>> I understand that Gemstone uses the ICU library and thus does not
>>> implement the algorithms in Smalltalk.
>>>
>>> I am currently looking into what the  ICU  library provides.
>>>
>>> I found as well a Ruby library [2] which implements CLDR [3]
>>>
>>> It has methods like this
>>>
>>> "Alphabetize a list using regular Ruby sort:"
>>>
>>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>>> $> ["Art", "Ved", "Wasa", "Älg"]
>>>
>>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>>
>>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>>> $> ["Älg", "Art", "Ved", "Wasa"]
>>>
>>> I hope that given such an example it would not be too difficult to
>>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>>> the interest is in getting sorting done in a cross-dialect-way.
>>>
>> I think that the issue (from a performance perspective) is that you can't
>> depend upon the value of the code point when doing collation --- the main
>> algorithm[5] is pretty much table based --- In addition to the different
>> sort orders based on characters there are even more arcane sort rules where
>> characters at the end of a word can affect the sort order of the word (for
>> more info see[4]).
>>
>> It is worth looking at the Conformance section of the Unicode spec[1] as
>> there are different levels of collation conformance .....
>>
>> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>>
>> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
>> They don't claim to be conformant to the Unicode Collation Algorithm[3], but
>> they are covering a big chunk of the standard use cases ....
>>
>> Dale
>>
>> [1] http://unicode.org/reports/tr10/#Conformance
>> [2] http://userguide.icu-project.org/collation
>> [3] http://www.unicode.org/reports/tr10/
>> [4] http://www.unicode.org/reports/tr10/#Introduction
>> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
>> [6]
>> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
>> [7] http://unicode.org/reports/tr10/#Tailoring