[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM euanmee at gmail.com
Wed Dec 9 00:33:51 UTC 2015


Reading up http://www.unicode.org/reports/tr15/#Examples

The Unicode standard seems to require you never to make
aGermanStrasse
equivalent in a sort order to the ligatured version,
aGermanStraße

This seems counter-intuitive to me.

Is there a reason for this?  Have I just simply picked this up wrongly?

On 8 December 2015 at 21:35, Dale Henrichs
<dale.henrichs at gemtalksystems.com> wrote:
> Euan,
>
> What I meant is that you can't _always_ use the code point for collation,
> i.e., sorting based on the value of code points is not always correct[1].
>
> If I'm not mistaken the fully-composed and fully-decomposed forms can only
> be used for testing the  equivalence of two strings[2] ...
>
> The Main Algorithm[3], starts by producing a normalized form of the string,
> but the subsequent steps (produce array, form sort key and compare) involves
> table lookups among other things ....
>
> Once you've produced a sort key for a string, the sort key does use "binary
> comparison" for collating , which is a byte by byte numeric comparison ...
>
> Dale
>
> [1] http://www.unicode.org/reports/tr10/#Common_Misperceptions
> [2] http://unicode.org/reports/tr15/pdtr15.html
> [3] http://www.unicode.org/reports/tr10/#Main_Algorithm
>
>
> On 12/08/2015 12:22 PM, EuanM wrote:
>>
>> Dale - is that you can't depend on the value of a codepoint
>> *unless the string is either in fully-composed form
>> (or has just been fully-decomposed from a fully-composed form) *
>>
>> OR are there circumstances where even those two cases cannot be relied
>> upon?
>>
>> On 8 December 2015 at 19:20, Dale Henrichs
>> <dale.henrichs at gemtalksystems.com> wrote:
>>>
>>>
>>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>>>
>>>> Dale
>>>>
>>>> Thank you for your answer with links to the ICU library and the notes
>>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>>> subclass of ByteArray.
>>>>
>>>> I understand that Gemstone uses the ICU library and thus does not
>>>> implement the algorithms in Smalltalk.
>>>>
>>>> I am currently looking into what the  ICU  library provides.
>>>>
>>>> I found as well a Ruby library [2] which implements CLDR [3]
>>>>
>>>> It has methods like this
>>>>
>>>> "Alphabetize a list using regular Ruby sort:"
>>>>
>>>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>>>> $> ["Art", "Ved", "Wasa", "Älg"]
>>>>
>>>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>>>
>>>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>>>> $> ["Älg", "Art", "Ved", "Wasa"]
>>>>
>>>> I hope that given such an example it would not be too difficult to
>>>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>>>> the interest is in getting sorting done in a cross-dialect-way.
>>>>
>>> I think that the issue (from a performance perspective) is that you can't
>>> depend upon the value of the code point when doing collation --- the main
>>> algorithm[5] is pretty much table based --- In addition to the different
>>> sort orders based on characters there are even more arcane sort rules
>>> where
>>> characters at the end of a word can affect the sort order of the word
>>> (for
>>> more info see[4]).
>>>
>>> It is worth looking at the Conformance section of the Unicode spec[1] as
>>> there are different levels of collation conformance .....
>>>
>>> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>>>
>>> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with
>>> tailoring[7].
>>> They don't claim to be conformant to the Unicode Collation Algorithm[3],
>>> but
>>> they are covering a big chunk of the standard use cases ....
>>>
>>> Dale
>>>
>>> [1] http://unicode.org/reports/tr10/#Conformance
>>> [2] http://userguide.icu-project.org/collation
>>> [3] http://www.unicode.org/reports/tr10/
>>> [4] http://www.unicode.org/reports/tr10/#Introduction
>>> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
>>> [6]
>>>
>>> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
>>> [7] http://unicode.org/reports/tr10/#Tailoring
>
>


More information about the Squeak-dev mailing list