[Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM euanmee at gmail.com
Tue Dec 8 23:50:38 UTC 2015


Dale,

yes - sorting based on the value of codepoints is almost always
guaranteed to be wrong.  Sorting is an application-specific issue, not
a technical Unicode issue, as there is more than one canonical sort
order per culture, and there is often more than one culture per
writing system.

e.g. ISO Latin 1 / Latin 9
covers these cultures (amongst others)
English (2 sort orders); Spanish; French (2 sort orders); German (2
sort orders); Swedish;  etc

German sort order differs from Swedish for the same characters, etc

Todd,

My thinking is that if we implement fully-composed strings as
heterogenous arrays, we sidestep a lot of the complexity of the ICU.

If it turns out that the performance is terrible, we can then seek to
incorporate the ICU.


On 8 December 2015 at 22:36, Todd Blanchard <tblanchard at mac.com> wrote:
> I just want to second Dale's endorsement of the ICU library.  It has been
> around a long time (originally developed by Taligent) and it provides the
> base unicode capabilities for an awful lot of software.
>
> I think it would make more sense to bring icu into Smalltalk as a
> NativeBoost library than to spend resources reimplementing and maintaining
> it.
>
> -Todd Blanchard
>
> On Dec 8, 2015, at 11:20, Dale Henrichs <dale.henrichs at gemtalksystems.com>
> wrote:
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>
> Dale
>
> Thank you for your answer with links to the ICU library and the notes
> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
> subclass of ByteArray.
>
> I understand that Gemstone uses the ICU library and thus does not
> implement the algorithms in Smalltalk.
>
> I am currently looking into what the  ICU  library provides.
>
>


More information about the Squeak-dev mailing list