[squeak-dev] Unicode Support

Sun Dec 6 15:23:58 UTC 2015

On 12/6/15, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> On Sat, 5 Dec 2015, Colin Putney wrote:
>
>> First, what's UTF-32? Second, we have the whole language tag thing that
>> nobody else uses.
>
> In Squeak, Strings use UTF-32 encoding[1]. It's straightforward
> to see for WideString, but ByteString is just a subset of WideString, so
> it uses the same encoding. We also use language tags, but that's a
> different story.
> Language tags make it possible to work around the problems introduced by
> the Han unification[2]. We shouldn't really use them for non-CJKV
> languages.
>
>>
>> Finally, UTF-8 is a great encoding that certain kinds of applications
>> really ought to use. Web apps, in particular, benefit from using UTF-8 so
>> the don't have to decode and then re-encode strings coming in from the
>> network. In DabbleDB we used UTF-8 encoded string in the image, and just
>> ignored the fact that they were displayed incorrectly by inspectors.
>> Having a proper UTF-8 string class would be useful.
>
> We do the same thing, but that doesn't mean it's a good idea to create a
> new String-like class having its content encoded in UTF-8, because
> UTF-8-encoded strings can't be modified like regular strings. While it
> would be possible to implement all operations, such implementation would
> become the next SortedCollection (bad performance due to misuse).


This is not the case if you go for ropes

https://github.com/KenDickey/Cuis-Smalltalk-Ropes

>
> Levente
>
> [1] https://en.wikipedia.org/wiki/UTF-32
> [2] https://en.wikipedia.org/wiki/Han_unification
>
>>
>> - Colin
>>
>>
>>> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>
>>> Why would you want to have strings with UTF-8 or UTF-16 encoding in the
>>> image?
>>> What's wrong with the current UTF-32 representation?
>>>
>>> Levente
>>>
>>>> On Fri, 4 Dec 2015, EuanM wrote:
>>>>
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>> - sortable UTF8 strings
>>>> - sortable UTF16 strings
>>>> - equivalence testing of 2 UTF8 strings
>>>> - equivalence testing of 2 UTF16 strings
>>>> - round-tripping UTF8 strings through Smalltalk
>>>> - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically
>>>> detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>> compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal
>>>> characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>   Euan
>>>>
>>>>
>>>
>>
>>
>
>