[squeak-dev] Unicode Support

Fri Dec 4 11:42:11 UTC 2015

I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.

This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.

Call to action:
==========

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===============
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==============
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===============================
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms
============
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=====================
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String

Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)

Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods

A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc

along the lines of
aListOfWords := SortedCollection sortBlock: deOrder

If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.

To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.

aUtf8String cleanUtf8String

With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)

(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)

Cheers,
    Euan