[squeak-dev] Unicode Support

Levente Uzonyi leves at caesar.elte.hu
Sun Dec 6 04:41:22 UTC 2015


On Sat, 5 Dec 2015, Colin Putney wrote:

> First, what's UTF-32? Second, we have the whole language tag thing that nobody else uses.

In Squeak, Strings use UTF-32 encoding[1]. It's straightforward 
to see for WideString, but ByteString is just a subset of WideString, so 
it uses the same encoding. We also use language tags, but that's a 
different story.
Language tags make it possible to work around the problems introduced by 
the Han unification[2]. We shouldn't really use them for non-CJKV 
languages.

>
> Finally, UTF-8 is a great encoding that certain kinds of applications really ought to use. Web apps, in particular, benefit from using UTF-8 so the don't have to decode and then re-encode strings coming in from the network. In DabbleDB we used UTF-8 encoded string in the image, and just ignored the fact that they were displayed incorrectly by inspectors. Having a proper UTF-8 string class would be useful.

We do the same thing, but that doesn't mean it's a good idea to create a 
new String-like class having its content encoded in UTF-8, because 
UTF-8-encoded strings can't be modified like regular strings. While it 
would be possible to implement all operations, such implementation would 
become the next SortedCollection (bad performance due to misuse).

Levente

[1] https://en.wikipedia.org/wiki/UTF-32
[2] https://en.wikipedia.org/wiki/Han_unification

>
> - Colin
>
>
>> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>
>> Why would you want to have strings with UTF-8 or UTF-16 encoding in the image?
>> What's wrong with the current UTF-32 representation?
>>
>> Levente
>>
>>> On Fri, 4 Dec 2015, EuanM wrote:
>>>
>>> I'm currently groping my way to seeing how feature-complete our
>>> Unicode support is.  I am doing this to establish what still needs to
>>> be done to provide full Unicode support.
>>>
>>> This seems to me to be an area where it would be best to write it
>>> once, and then have the same codebase incorporated into the Smalltalks
>>> that most share a common ancestry.
>>>
>>> I am keen to get: equality-testing for strings; sortability for
>>> strings which have ligatures and diacritic characters; and correct
>>> round-tripping of data.
>>>
>>> Call to action:
>>> ==========
>>>
>>> If you have comments on these proposals - such as "but we already have
>>> that facility" or "the reason we do not have these facilities is
>>> because they are dog-slow" - please let me know them.
>>>
>>> If you would like to help out, please let me know.
>>>
>>> If you have Unicode experience and expertise, and would like to be, or
>>> would be willing to be, in the  'council of experts' for this project,
>>> please let me know.
>>>
>>> If you have comments or ideas on anything mentioned in this email
>>>
>>> In the first instance, the initiative's website will be:
>>> http://smalltalk.uk.to/unicode.html
>>>
>>> I have created a SqueakSource.com project called UnicodeSupport
>>>
>>> I want to avoid re-inventing any facilities which already exist.
>>> Except where they prevent us reaching the goals of:
>>> - sortable UTF8 strings
>>> - sortable UTF16 strings
>>> - equivalence testing of 2 UTF8 strings
>>> - equivalence testing of 2 UTF16 strings
>>> - round-tripping UTF8 strings through Smalltalk
>>> - roundtripping UTF16 strings through Smalltalk.
>>> As I understand it, we have limited Unicode support atm.
>>>
>>> Current state of play
>>> ===============
>>> ByteString gets converted to WideString when need is automagically detected.
>>>
>>> Is there anything else that currently exists?
>>>
>>> Definition of Terms
>>> ==============
>>> A quick definition of terms before I go any further:
>>>
>>> Standard terms from the Unicode standard
>>> ===============================
>>> a compatibility character : an additional encoding of a *normal*
>>> character, for compatibility and round-trip conversion purposes.  For
>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>
>>> Made-up terms
>>> ============
>>> a convenience codepoint :  a single codepoint which represents an item
>>> that is also encoded as a string of codepoints.
>>>
>>> (I tend to use the terms compatibility character and compatibility
>>> codepoint interchangably.  The standard only refers to them as
>>> compatibility characters.  However, the standard is determined to
>>> emphasise that characters are abstract and that codepoints are
>>> concrete.  So I think it is often more useful and productive to think
>>> of compatibility or convenience codepoints).
>>>
>>> a composed character :  a character made up of several codepoints
>>>
>>> Unicode encoding explained
>>> =====================
>>> A convenience codepoint can therefore be thought of as a code point
>>> used for a character which also has a composed form.
>>>
>>> The way Unicode works is that sometimes you can encode a character in
>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>> sometimes not.
>>>
>>> You can therefore have a long stream of ASCII which is single-byte
>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>> stream, it would be represented either by a compatibility character or
>>> by a multi-byte combination.
>>>
>>> Using compatibility characters can prevent proper sorting and
>>> equivalence testing.
>>>
>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>> compatibility issues and round-tripping problems.
>>>
>>> Currently my thinking is:
>>>
>>> a Utf8String class
>>> an Ordered collection, with 1 byte characters as the modal element,
>>> but short arrays of wider strings where necessary
>>> a Utf16String class
>>> an Ordered collection, with 2 byte characters as the modal element,
>>> but short arrays of wider strings
>>> beginning with a 2-byte endianness indicator.
>>>
>>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>>
>>> So my thinking is that Utf8String will contain convenience codepoints,
>>> for round-tripping.  And where there are multiple convenience
>>> codepoints for a character, that it standardises on one.
>>>
>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>>
>>> We then need methods to convert between the two.
>>>
>>> aUtf8String asUtf8SortableString
>>>
>>> and
>>>
>>> aUtf8SortableString asUtf8String
>>>
>>>
>>> Sort orders are culture and context dependent - Sweden and Germany
>>> have different sort orders for the same diacritic-ed characters.  Some
>>> countries have one order in general usage, and another for specific
>>> usages, such as phone directories (e.g. UK and France)
>>>
>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>> conversion methods
>>>
>>> A list of sorted words would be a SortedCollection, and there could be
>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>> seOrder, ukOrder, etc
>>>
>>> along the lines of
>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>
>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>> then we can perform equivalence testing on them trivially.
>>>
>>> To make sure a Utf8String is well formed, we would need to have a way
>>> of cleaning up any convenience codepoints which were valid, but which
>>> were for a character which has multiple equally-valid alternative
>>> convenience codepoints, and for which the string currently had the
>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>> alternative convenience codepoints, we would choose one to be in the
>>> well-formed Utf8String, and we would need a method for cleaning the
>>> alternative convenience codepoints out of the string, and replacing
>>> them with the chosen approved convenience codepoint.
>>>
>>> aUtf8String cleanUtf8String
>>>
>>> With WideString, a lot of the issues disappear - except
>>> round-tripping(although I'm sure I have seen something recently about
>>> 4-byte strings that also have an additional bit.  Which would make
>>> some Unicode characters 5-bytes long.)
>>>
>>>
>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>> subtle, or somewhere in between, please let me know)
>>>
>>> Cheers,
>>>   Euan
>>>
>>>
>>
>
>


More information about the Squeak-dev mailing list