[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs dale.henrichs at gemtalksystems.com
Tue Dec 8 19:20:48 UTC 2015



On 12/07/2015 11:31 PM, H. Hirzel wrote:
> Dale
>
> Thank you for your answer with links to the ICU library and the notes
> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
> subclass of ByteArray.
>
> I understand that Gemstone uses the ICU library and thus does not
> implement the algorithms in Smalltalk.
>
> I am currently looking into what the  ICU  library provides.
>
> I found as well a Ruby library [2] which implements CLDR [3]
>
> It has methods like this
>
> "Alphabetize a list using regular Ruby sort:"
>
> $> ["Art", "Wasa", "Älg", "Ved"].sort
> $> ["Art", "Ved", "Wasa", "Älg"]
>
> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>
> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
> $> ["Älg", "Art", "Ved", "Wasa"]
>
> I hope that given such an example it would not be too difficult to
> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
> the interest is in getting sorting done in a cross-dialect-way.
>

I think that the issue (from a performance perspective) is that you 
can't depend upon the value of the code point when doing collation --- 
the main algorithm[5] is pretty much table based --- In addition to the 
different sort orders based on characters there are even more arcane 
sort rules where characters at the end of a word can affect the sort 
order of the word (for more info see[4]).

It is worth looking at the Conformance section of the Unicode spec[1] as 
there are different levels of collation conformance .....

ICU conforms[2] to to UTS #10[3], the highest level of conformance ...

It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with 
tailoring[7]. They don't claim to be conformant to the Unicode Collation 
Algorithm[3], but they are covering a big chunk of the standard use 
cases ....

Dale

[1] http://unicode.org/reports/tr10/#Conformance
[2] http://userguide.icu-project.org/collation
[3] http://www.unicode.org/reports/tr10/
[4] http://www.unicode.org/reports/tr10/#Introduction
[5] http://www.unicode.org/reports/tr10/#Main_Algorithm
[6] 
https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[7] http://unicode.org/reports/tr10/#Tailoring


More information about the Squeak-dev mailing list