[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

H. Hirzel hannes.hirzel at gmail.com
Tue Dec 8 07:31:21 UTC 2015


Dale

Thank you for your answer with links to the ICU library and the notes
about classes in Gemstone. Noteworthy that you have a class Utf8 as a
subclass of ByteArray.

I understand that Gemstone uses the ICU library and thus does not
implement the algorithms in Smalltalk.

I am currently looking into what the  ICU  library provides.

I found as well a Ruby library [2] which implements CLDR [3]

It has methods like this

"Alphabetize a list using regular Ruby sort:"

$> ["Art", "Wasa", "Älg", "Ved"].sort
$> ["Art", "Ved", "Wasa", "Älg"]

Alphabetize a list using TwitterCLDR’s locale-aware sort:

$> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
$> ["Älg", "Art", "Ved", "Wasa"]

I hope that given such an example it would not be too difficult to
reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
the interest is in getting sorting done in a cross-dialect-way.

--Hannes

[2] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[3]  Unicode Common Locale Data Repository http://cldr.unicode.org/index

On 12/7/15, Dale Henrichs <dale.henrichs at gemtalksystems.com> wrote:
> Hannes,
>
> For GemStone, we are using the ICU library[1]. We have Unicode7,
> Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for
> internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8
> encoded strings ...
>
> The ICU library provides the primitive implementations for working with
> the Unicode* and Utf8 classes
>
> When we started considering Unicode support, we looked at what it would
> take to support collation - our main reason for looking at Unicode in
> the first place) -- and we saw just how complicated the collation rules
> can be[2], we were glad to see that someone had already done the hard
> work[1]...
>
> Reconciling our legacy String implementations (String, DoubleByteString,
> and QuadByteString) with the Unicode* classes was also interesting,
> because the rules for Unicode equality and our legacy equality
> implementation were not quite compatible.
>
> If you are interested in more information, I can share additional
> details ...
>
> Dale
>
> [1] http://site.icu-project.org/
> [2] http://unicode.org/reports/tr10/
>
> On 12/07/2015 11:54 AM, H. Hirzel wrote:
>> Hello
>>
>> According to http://www.unicode.org/cldr/charts/27/collation/de.html the
>> German
>> phonebook sort order is
>>
>> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
>> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
>> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
>> V w W x X y Y z Z
>>
>> I wonder why it looks like this. A lot of characters which never
>> appear in a German text.
>>
>>
>> For Spanish there is 'traditional' and 'standard'
>>
>> http://www.unicode.org/cldr/charts/27/collation/es.html
>>
>> standard	a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
>> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
>> ü Ü v V w W x X y Y z Z
>>
>> traditional	a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
>> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
>> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
>> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
>> S t T u U ú Ú ü Ü v V w W x X y Y z Z
>>
>> And French is not easily found
>> http://www.unicode.org/cldr/charts/27/collation/index.html
>> or seems to be defined elsewhere
>>
>> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml
>>
>> Suggestions and hints are welcome
>>
>> --Hannes
>> _______________________________________________
>> Cuis mailing list
>> Cuis at jvuletich.org
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>


More information about the Squeak-dev mailing list