[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Tue Dec 8 20:22:12 UTC 2015

Dale - is that you can't depend on the value of a codepoint
*unless the string is either in fully-composed form
(or has just been fully-decomposed from a fully-composed form) *

OR are there circumstances where even those two cases cannot be relied upon?

On 8 December 2015 at 19:20, Dale Henrichs
<dale.henrichs at gemtalksystems.com> wrote:
>
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>
>> Dale
>>
>> Thank you for your answer with links to the ICU library and the notes
>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>> subclass of ByteArray.
>>
>> I understand that Gemstone uses the ICU library and thus does not
>> implement the algorithms in Smalltalk.
>>
>> I am currently looking into what the  ICU  library provides.
>>
>> I found as well a Ruby library [2] which implements CLDR [3]
>>
>> It has methods like this
>>
>> "Alphabetize a list using regular Ruby sort:"
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>> $> ["Art", "Ved", "Wasa", "Älg"]
>>
>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>> $> ["Älg", "Art", "Ved", "Wasa"]
>>
>> I hope that given such an example it would not be too difficult to
>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>> the interest is in getting sorting done in a cross-dialect-way.
>>
>
> I think that the issue (from a performance perspective) is that you can't
> depend upon the value of the code point when doing collation --- the main
> algorithm[5] is pretty much table based --- In addition to the different
> sort orders based on characters there are even more arcane sort rules where
> characters at the end of a word can affect the sort order of the word (for
> more info see[4]).
>
> It is worth looking at the Conformance section of the Unicode spec[1] as
> there are different levels of collation conformance .....
>
> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>
> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
> They don't claim to be conformant to the Unicode Collation Algorithm[3], but
> they are covering a big chunk of the standard use cases ....
>
> Dale
>
> [1] http://unicode.org/reports/tr10/#Conformance
> [2] http://userguide.icu-project.org/collation
> [3] http://www.unicode.org/reports/tr10/
> [4] http://www.unicode.org/reports/tr10/#Introduction
> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
> [6]
> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
> [7] http://unicode.org/reports/tr10/#Tailoring