[Unicode] collation sequences (Re: [squeak-dev] Unicode Support)

Mon Dec 7 11:18:36 UTC 2015

Hannes,

The Unicode standard provide compatibility codepoints for
compatibility purposes and prefers all characters to be represented
composed form - as that way they are comparable and sortable.

(Some composed characters have *more than one* compatibility codepoint.

The canonical example is the composed character #(0041 030a) which can
be represented by EITHER the compatibility codepoint #(00c5) "Latin
Capital Letter A with Ring" above OR by #(212b) "Angstrom sign"  )

On 7 December 2015 at 08:17, H. Hirzel <hannes.hirzel at gmail.com> wrote:
> On 12/7/15, EuanM <euanmee at gmail.com> wrote:
>> My current thinking for collation sequences:
>>
>> All strings being collated have had all compatibility codepoints
>> expanded into composed sequences.
>
> What does the Unicode manual suggest?  (www.unicode.org reference?)
>
>
>
>>
>> Strings containing composed sequences and UTF
>> -8 strings containing multi-byte characters have these represented by
>> a very short ordered collection in place of the single  Byte of a
>> ByteString.
>>
>> When we sort characters,  words or phrases of strings that contain
>> zero compatibility codepoints, we simply pull a pre-defined sortBlock
>> out of a Dictionary of pre-defined sortBlocks
>>
>>
>> aDictionaryOfSorts at: ukPhoneBook put: aSortBlock
>> or
>> aDictionaryOfSorts at: ukPhoneBook put: '[aString representing the
>> code of a sortBlock]' .
>>
>> ASortedCollectionOfUtf8Strings sortBlock: aDictionaryOfSorts at:
>> ukPhoneBook
>>
>>  - or some actual working code!  :-)
>>
>
> Yes, focusing on this is a real need.
>
>
>>
>> On 6 December 2015 at 15:14, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>>> P.S. The 30-bit value for each character in Squeak/Pharo (if necessary
>>> together with an additional language tag) is a potentially very
>>> capable infrastructure. Not really used much at the moment.
>>>
>>> The challenge is to to make _existing_ Unicode-know-how defined
>>> elsewhere (e.g. www.unicode.org) available in Squeak/Pharo/Cuis.
>>>
>>> Most simple cases would be to start with collation sequences in
>>> Italian, French, German, Spanish, Portugese. Later on more complex
>>> cases like Arabic.
>>>
>>> --HH
>>>
>>> On 12/6/15, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>>>> Hi Euan,
>>>>
>>>> On 12/4/15, EuanM <euanmee at gmail.com> wrote:
>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>> be done to provide full Unicode support.
>>>>>
>>>>> This seems to me to be an area where it would be best to write it
>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>> that most share a common ancestry.
>>>>>
>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>> strings which have ligatures and diacritic characters; and correct
>>>>> round-tripping of data.
>>>>
>>>> These  goals call for a package with SUnit tests which you then can
>>>> run on all platforms. This will be a tool to evalutate platforms for
>>>> the level of Unicode support.
>>>> As mentioned in the thread I would focus on UTF8 only as far as
>>>> external files are concerned.
>>>> I.E. the test package writes a sample UFT8 file and then reads it to
>>>> do the various tests.
>>>> I have started doing this for Squeak and Cuis some time ago with a few
>>>> tests.
>>>>
>>>> I am interested in sortability. Round-tripping is fine if you go for
>>>> UTF8.
>>>> Important of course is which languages you think the package should
>>>> work. Some of them are easy, some not.
>>>>
>>>> This afternoon I did some updates on the Squeak wiki
>>>> http://wiki.squeak.org/squeak/recent
>>>>
>>>> --Hannes
>>>>
>>>>>
>>>>> Call to action:
>>>>> ==========
>>>>>
>>>>> If you have comments on these proposals - such as "but we already have
>>>>> that facility" or "the reason we do not have these facilities is
>>>>> because they are dog-slow" - please let me know them.
>>>>>
>>>>> If you would like to help out, please let me know.
>>>>>
>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>> please let me know.
>>>>>
>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>
>>>>> In the first instance, the initiative's website will be:
>>>>> http://smalltalk.uk.to/unicode.html
>>>>>
>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>
>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>> Except where they prevent us reaching the goals of:
>>>>>   - sortable UTF8 strings
>>>>>   - sortable UTF16 strings
>>>>>   - equivalence testing of 2 UTF8 strings
>>>>>   - equivalence testing of 2 UTF16 strings
>>>>>   - round-tripping UTF8 strings through Smalltalk
>>>>>   - roundtripping UTF16 strings through Smalltalk.
>>>>> As I understand it, we have limited Unicode support atm.
>>>>>
>>>>> Current state of play
>>>>> ===============
>>>>> ByteString gets converted to WideString when need is automagically
>>>>> detected.
>>>>>
>>>>> Is there anything else that currently exists?
>>>>>
>>>>> Definition of Terms
>>>>> ==============
>>>>> A quick definition of terms before I go any further:
>>>>>
>>>>> Standard terms from the Unicode standard
>>>>> ===============================
>>>>> a compatibility character : an additional encoding of a *normal*
>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>
>>>>> Made-up terms
>>>>> ============
>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>> that is also encoded as a string of codepoints.
>>>>>
>>>>> (I tend to use the terms compatibility character and compatibility
>>>>> codepoint interchangably.  The standard only refers to them as
>>>>> compatibility characters.  However, the standard is determined to
>>>>> emphasise that characters are abstract and that codepoints are
>>>>> concrete.  So I think it is often more useful and productive to think
>>>>> of compatibility or convenience codepoints).
>>>>>
>>>>> a composed character :  a character made up of several codepoints
>>>>>
>>>>> Unicode encoding explained
>>>>> =====================
>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>> used for a character which also has a composed form.
>>>>>
>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>> sometimes not.
>>>>>
>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>> stream, it would be represented either by a compatibility character or
>>>>> by a multi-byte combination.
>>>>>
>>>>> Using compatibility characters can prevent proper sorting and
>>>>> equivalence testing.
>>>>>
>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>> compatibility issues and round-tripping problems.
>>>>>
>>>>> Currently my thinking is:
>>>>>
>>>>> a Utf8String class
>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>> but short arrays of wider strings where necessary
>>>>> a Utf16String class
>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>> but short arrays of wider strings
>>>>> beginning with a 2-byte endianness indicator.
>>>>>
>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>> compatible.
>>>>>
>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>> for round-tripping.  And where there are multiple convenience
>>>>> codepoints for a character, that it standardises on one.
>>>>>
>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>> characters.
>>>>>
>>>>> We then need methods to convert between the two.
>>>>>
>>>>> aUtf8String asUtf8SortableString
>>>>>
>>>>> and
>>>>>
>>>>> aUtf8SortableString asUtf8String
>>>>>
>>>>>
>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>> countries have one order in general usage, and another for specific
>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>
>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>> conversion methods
>>>>>
>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>> seOrder, ukOrder, etc
>>>>>
>>>>> along the lines of
>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>
>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>> then we can perform equivalence testing on them trivially.
>>>>>
>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>> were for a character which has multiple equally-valid alternative
>>>>> convenience codepoints, and for which the string currently had the
>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>> alternative convenience codepoints out of the string, and replacing
>>>>> them with the chosen approved convenience codepoint.
>>>>>
>>>>> aUtf8String cleanUtf8String
>>>>>
>>>>> With WideString, a lot of the issues disappear - except
>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>> some Unicode characters 5-bytes long.)
>>>>>
>>>>>
>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>> subtle, or somewhere in between, please let me know)
>>>>>
>>>>> Cheers,
>>>>>     Euan
>>>>>
>>>>>
>>>>
>>