[squeak-dev] Re: [Pharo-dev] Unicode Support

Mon Dec 7 11:27:47 UTC 2015

Hi all,

First of all, I'm sorry for leaving Squeak m17n work incomplete.
Things are degrading a bit by bit and many things are not working as
good as before, unfortunately.

That said, there are a few things I'd like to mention:

On Sun, Dec 6, 2015 at 7:21 PM, EuanM <euanmee at gmail.com> wrote:
> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
> http://smalltalk.uk.to/unicode-utf8.html
> and my Smalltalk in Small Steps blog at:
> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>
> My current thinking, and understanding.
> ==============================
>
> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>     b) UTF-8 can encode all of those characters in 1 byte, but can
> prefer some of them to be encoded as sequences of multiple bytes.  And
> can encode additional characters as sequences of multiple bytes.
>
> 1) Smalltalk has long had multiple String classes.

Yes, but never meant to make it user visible, in the same sense that a
typical user does not (always) have to worry about the difference
between SmallInteger and LargeInteger.

> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>     is encoded as a UTF-8 codepoint of nn hex.

module endianness, but yes.

> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
> - FF hex ) are defined identically in UTF-8.

3) to 6) are more or less correct but this 7) is not right, if you
mean what I think you mean.

> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>          all ASCII maps 1:1 to Unicode UTF-8
>          all ISO-8859-1 maps 1:1 to Unicode UTF-8

so this is not correct in the same reason.

> 9) All ByteStrings elements which are either a valid ISO-8859-1
> character  or a valid ASCII character are *also* a valid UTF-8
> character.

No.  ByteStrings are meant to be ISO-8859-1.  Unfortunately, Squeak
does use ByteString to store UTF-8 (my intention was only transiently;
in hindsight, it would have been a better convention to use ByteArray
for this transient UTF-8 data.)

> 11) The preferred Unicode representation of the characters which have
> compatibility codepoints is as a  a short set of codepoints
> representing the characters which are combined together to form the
> glyph of the convenience codepoint, as a sequence of bytes
> representing the component characters.
>
>
> 12) Some concrete examples:
>
> £ (GBP currency symbol)
> In ISO-8859-1, not in ASCII
> ASCII : A3 hex is not a valid ASCII code
> UTF-8: £ - A3 hex

This is 0xC2 0xA3, not A3.

> Upper Case C cedilla
> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
> *and* a composed set of codepoints
> ASCII : C7 hex is not a valid ASCII character code
> ISO-8859-1 : Upper Case C cedilla - C7 hex
> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex

no, and,

> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>    Upper case C 0043 hex (Upper case C)
>        followed by
>    cedilla 00B8 hex (cedilla)

no.  The codepoint that follows is U+0327, or 0xCC 0xA7 in UTF-8.

> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
> aByteString is completely adequate for editing and display.

So unfortunately this is not true.

> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
> string, upper and lower case versions of the same character will be
> treated differently.
>
> 15) When sorting any valid ISO-8859-1 string containing
> letter+diacritic combination glyphs or ligature combination glyphs,
> the glyphs in combination will treated differently to a "plain" glyph
> of the character
> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
> "fs" will be treated very differently.

The statement is true but perhaps you mean ss instead of fs?

> a Utf8CompatibilityString class.
>
>    asByteString  - ensure only compatibility codepoints are used.
> Ensure it doews not encode characters above 00FF hex.
>
>    asIso8859String - ensures only compatibility codepoints are used,
> and that the characters are each valid ISO 8859-1
>
>    asAsciiString - ensures only characters 00hex - 7F hex are used.
>
>    asUtf8ComposedIso8859String - ensures all compatibility codepoints
> are expanded into small OrderedCollections of codepoints
>
> a Utf8ComposedIso8859String class - will provide sortable and
> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>
> Then a Utf8SortableCollection class - a collection of
> Utf8ComposedIso8859Strings words and phrases.
>
> Custom sortBlocks will define the applicable sort order.
>
> We can create a collection...  a Dictionary, thinking about it, of
> named, prefabricated sortBlocks.
>
> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>
> If anyone has better names for the classes, please let me know.
>
> If anyone else wants to help
>     - build these,
>     - create SUnit tests for these
>     - write documentation for these
> Please let me know.

My feeling is that these extra classes are totally overkill and not
necessary.  Unfortunately, I have not been following the discussion
very closely, but what is the problem that is being solved here?

-- 
-- Yoshiki