[squeak-dev] Re: [Pharo-dev] Unicode Support

Mon Dec 7 03:21:29 UTC 2015

This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
http://smalltalk.uk.to/unicode-utf8.html
and my Smalltalk in Small Steps blog at:
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html

My current thinking, and understanding.
==============================

0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
    b) UTF-8 can encode all of those characters in 1 byte, but can
prefer some of them to be encoded as sequences of multiple bytes.  And
can encode additional characters as sequences of multiple bytes.

1) Smalltalk has long had multiple String classes.

2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
    is encoded as a UTF-8 codepoint of nn hex.

3) All valid ISO-8859-1 characters have a character code between 20
hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

4) All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII

5) a) All character codes which are defined within ISO-8859-1 and also
defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
defined identically in both.

b) All printable ASCII characters are defined identically in both
ASCII and ISO-8859-1

6) All character codes defined in ASCII  (00 hex to 7E hex) are
defined identically in Unicode UTF-8.

7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
- FF hex ) are defined identically in UTF-8.

8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
         all ASCII maps 1:1 to Unicode UTF-8
         all ISO-8859-1 maps 1:1 to Unicode UTF-8

9) All ByteStrings elements which are either a valid ISO-8859-1
character  or a valid ASCII character are *also* a valid UTF-8
character.

10) ISO-8859-1 characters representing a character with a diacritic,
or a two-character ligature, have no ASCII equivalent.  In Unicode
UTF-8, those character codes which are representing compound glyphs,
are called "compatibility codepoints".

11) The preferred Unicode representation of the characters which have
compatibility codepoints is as a  a short set of codepoints
representing the characters which are combined together to form the
glyph of the convenience codepoint, as a sequence of bytes
representing the component characters.

12) Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex

BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex

£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex

Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
*and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
   Upper case C 0043 hex (Upper case C)
       followed by
   cedilla 00B8 hex (cedilla)

13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
aByteString is completely adequate for editing and display.

14) When sorting any valid ASCII string *or* any valid ISO-8859-1
string, upper and lower case versions of the same character will be
treated differently.

15) When sorting any valid ISO-8859-1 string containing
letter+diacritic combination glyphs or ligature combination glyphs,
the glyphs in combination will treated differently to a "plain" glyph
of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
"fs" will be treated very differently.

16) Different nations have different rules about where diacritic-ed
characted and ligature pairs should be placed when in alphabetical
order.

17) Some nations even have multiple standards - e.g.  surnames
beginning either "M superscript-c" or "M superscript-a superscript-c"
are treated as beginning equivalently in UK phone directories, but not
in other situations.

Some practical upshots
==================

1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
for any single character it considers valid, or any ByteString it has
made up of characters it considers valid.

2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
other Smalltalk with a single byte ByteString following ASCII or
ISO-8859-1.

3) Any Smalltalk (or derivative language) using ByteString can
immediately consider it's ByteString as valid UTF-8, as long as it
also considers the ByteSring as valid ASCII and/or ISO-8859-1.

4) All of those can be successfully exported to any system using UTF-8
(e.g. HTML).

5) To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters
b) convert UTF-8 strings with composed characters into UTF-8 strings
that use *only* compatibility codepoints.

Class + protocol proposals

a Utf8CompatibilityString class.

   asByteString  - ensure only compatibility codepoints are used.
Ensure it doews not encode characters above 00FF hex.

   asIso8859String - ensures only compatibility codepoints are used,
and that the characters are each valid ISO 8859-1

   asAsciiString - ensures only characters 00hex - 7F hex are used.

   asUtf8ComposedIso8859String - ensures all compatibility codepoints
are expanded into small OrderedCollections of codepoints

a Utf8ComposedIso8859String class - will provide sortable and
comparable UTF8 strings of all ASCII and ISO 8859-1 strings.

Then a Utf8SortableCollection class - a collection of
Utf8ComposedIso8859Strings words and phrases.

Custom sortBlocks will define the applicable sort order.

We can create a collection...  a Dictionary, thinking about it, of
named, prefabricated sortBlocks.

This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.

If anyone has better names for the classes, please let me know.

If anyone else wants to help
    - build these,
    - create SUnit tests for these
    - write documentation for these
Please let me know.

n.b. I have had absolutely no experience of Ropes.

My own background with this stuff:  In the early 90's as a Project
Manager implementing office automation systems across a global
company, with offices in the Americas, Western, Eastern and Central
Europe, (including Slavic and Cyrillic users) nations, Japan and
China. The mission-critical application was word-processing.

Our offices were spread around the globe, and we needed those offices
to successfully exchange documents with their sister offices, and with
the customers in each region the offices were in.

Unicode was then new, and our platform supplier was the NeXT
Corporation, who had been founder members in of the Unicode Consortium
in 1990.

So far: I've read the latest version of the Unicode Standard (v8.0).
This is freely downloadable.
I've purchased a paper copy of an earlier release.  New releases
typically consist additional codespaces (i.e. alphabets).  So old
copies are useful, as well as cheap.  (Paper copies of  version 4.0
are available second-hand for < $10 / €10).

The typical change with each release is the addition of further
codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
(I'll be going through my V4.0 just to make sure)

Cheers,
   Euan

On 5 December 2015 at 13:08, stepharo <stepharo at free.fr> wrote:
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
>     start small
>     make steady progress
>     write tests
>     commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>    - sortable UTF8 strings
>>    - sortable UTF16 strings
>>    - equivalence testing of 2 UTF8 strings
>>    - equivalence testing of 2 UTF16 strings
>>    - round-tripping UTF8 strings through Smalltalk
>>    - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>> compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal
>> characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>      Euan
>>
>>
>
>