[squeak-dev] Re: [Pharo-dev] Unicode Support

EuanM euanmee at gmail.com
Mon Dec 7 11:10:45 UTC 2015


Hi Sven, okay I'm plodding my through https://tools.ietf.org/html/rfc3629 and
https://en.wikipedia.org/wiki/UTF-8#Examples
to see what's what.




On 7 December 2015 at 11:01, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>
>> On 07 Dec 2015, at 11:51, EuanM <euanmee at gmail.com> wrote:
>>
>> Verifying assumptions is the key reason why you should documents like
>> this out for review.
>
> Fair enough, discussion can only help.
>
>> Sven -
>>
>> Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)
>>
>> Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).
>
> Ah, that was a typo, I meant, of course (and sorry for the confusion):
>
> 'Les élèves Français' encodeWith: #iso88591.
>
> "#[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]"
>
> 'Les élèves Français' utf8Encoded
>
> "#[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115]"
>
> Or shorter, $é is encoded in ISO-88591-1 as #[233], but as #[195 169] in UTF-8.
>
> That Cuis chose ISO-8859-15 makes no real difference.
>
> The thing is: you started talking about UTF-8 encoded strings in the image, and then the difference between code point and encoding is really important.
>
> Only in ASCII is the encoding identical, not for anything else.
>
>> We caught the right specification bug for the wrong reason.
>>
>> Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
>> image include and use only 1-byte strings. Chose to use ISO-8859-15"
>>
>> I have double-checked - each character encoded in ISO Latin 15 (ISO
>> 8859-15) is exactly the character represented by the corresponding
>> 1-byte codepoint in Unicode 0000 to 00FF,
>>
>> with the following exceptions:
>>
>> codepoint 20ac - Euro symbol
>> character code a4 (replaces codepoint 00a4 generic currency symbol)
>>
>> codepoint 0160 Latin Upper Case S with Caron
>> character code a6  (replaces codepoint 00A6 was | Unix pipe character)
>>
>> codepoint 0161 Latin Lower Case s with Caron
>> character code a8 (replaces codepoint 00A8 was dierisis)
>>
>> codepoint 017d Latin Upper Case Z with Caron
>> character code b4 (replaces codepoint 00b4 was Acute accent)
>>
>> codepoint 017e Latin Lower Case Z with Caron
>> character code b8 (replaces codepoint 00b8 was cedilla)
>>
>> codepoint 0152 Upper Case OE ligature = Ethel
>> character code bc (replaces codepoint 00bc was 1/4 symbol)
>>
>> codepoint 0153 Lower Case oe ligature = ethel
>> character code bd (replaces codepoint 00bd was 1/2 symbol)
>>
>> codepoint 0178 Upper Case Y diaeresis
>> character code be (replaces codepoint 00be was 3/4 symbol)
>>
>> Juan - I don't suppose we could persuade you to change to ISO  Latin-1
>> from ISO Latin-9 ?
>>
>> It means we could run the same 1 byte string encoding across  Cuis,
>> Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
>> and Gnu Smalltalk.
>>
>> The downside would be that French Y diaeresis would lose the ability
>> to use that character, along with users of oe, OE, and s, S, z, Z with
>> caron.  Along with the Euro.
>>
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-15.
>>
>> I'm confident I understand the use of UTF-8 in principal.
>>
>>
>> On 7 December 2015 at 08:27, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>>> I am sorry but one of your basic assumptions is completely wrong:
>>>
>>> 'Les élèves Français' encodeWith: #iso99591.
>>>
>>> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>>>
>>> 'Les élèves Français' utf8Encoded.
>>>
>>> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115]
>>>
>>> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !!
>>>
>>> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.
>>>
>>> So more than half the points you make, or the facts that you state, are thus plain wrong.
>>>
>>> The only thing that is correct is that the code points are equal, but that is not the same as the encoding !
>>>
>>> From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String.
>>>
>>> Sorry.
>>>
>>> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world.
>>>
>>>> On 07 Dec 2015, at 04:21, EuanM <euanmee at gmail.com> wrote:
>>>>
>>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>>> http://smalltalk.uk.to/unicode-utf8.html
>>>> and my Smalltalk in Small Steps blog at:
>>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>>>
>>>> My current thinking, and understanding.
>>>> ==============================
>>>>
>>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>>  b) UTF-8 can encode all of those characters in 1 byte, but can
>>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>>> can encode additional characters as sequences of multiple bytes.
>>>>
>>>> 1) Smalltalk has long had multiple String classes.
>>>>
>>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>>  is encoded as a UTF-8 codepoint of nn hex.
>>>>
>>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>>> hex and 7E hex, or between A0 hex and FF hex.
>>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>>>
>>>> 4) All valid ASCII characters have a character code between 00 hex and 7E hex.
>>>> https://en.wikipedia.org/wiki/ASCII
>>>>
>>>>
>>>> 5) a) All character codes which are defined within ISO-8859-1 and also
>>>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>>>> defined identically in both.
>>>>
>>>> b) All printable ASCII characters are defined identically in both
>>>> ASCII and ISO-8859-1
>>>>
>>>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>>>> defined identically in Unicode UTF-8.
>>>>
>>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>>>> - FF hex ) are defined identically in UTF-8.
>>>>
>>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>>>       all ASCII maps 1:1 to Unicode UTF-8
>>>>       all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>>>
>>>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>>>> character  or a valid ASCII character are *also* a valid UTF-8
>>>> character.
>>>>
>>>> 10) ISO-8859-1 characters representing a character with a diacritic,
>>>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>>>> UTF-8, those character codes which are representing compound glyphs,
>>>> are called "compatibility codepoints".
>>>>
>>>> 11) The preferred Unicode representation of the characters which have
>>>> compatibility codepoints is as a  a short set of codepoints
>>>> representing the characters which are combined together to form the
>>>> glyph of the convenience codepoint, as a sequence of bytes
>>>> representing the component characters.
>>>>
>>>>
>>>> 12) Some concrete examples:
>>>>
>>>> A - aka Upper Case A
>>>> In ASCII, in ISO 8859-1
>>>> ASCII A - 41 hex
>>>> ISO-8859-1 A - 41 hex
>>>> UTF-8 A - 41 hex
>>>>
>>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>>>> In ASCII, not in ISO 8859-1
>>>> ASCII : BEL  - 07 hex
>>>> ISO-8859-1 : 07 hex is not a valid character code
>>>> UTF-8 : BEL - 07 hex
>>>>
>>>> £ (GBP currency symbol)
>>>> In ISO-8859-1, not in ASCII
>>>> ASCII : A3 hex is not a valid ASCII code
>>>> UTF-8: £ - A3 hex
>>>> ISO-8859-1: £ - A3 hex
>>>>
>>>> Upper Case C cedilla
>>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>>>> *and* a composed set of codepoints
>>>> ASCII : C7 hex is not a valid ASCII character code
>>>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>>>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>>> Upper case C 0043 hex (Upper case C)
>>>>     followed by
>>>> cedilla 00B8 hex (cedilla)
>>>>
>>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>>>> aByteString is completely adequate for editing and display.
>>>>
>>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>>>> string, upper and lower case versions of the same character will be
>>>> treated differently.
>>>>
>>>> 15) When sorting any valid ISO-8859-1 string containing
>>>> letter+diacritic combination glyphs or ligature combination glyphs,
>>>> the glyphs in combination will treated differently to a "plain" glyph
>>>> of the character
>>>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>>>> "fs" will be treated very differently.
>>>>
>>>> 16) Different nations have different rules about where diacritic-ed
>>>> characted and ligature pairs should be placed when in alphabetical
>>>> order.
>>>>
>>>> 17) Some nations even have multiple standards - e.g.  surnames
>>>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>>>> are treated as beginning equivalently in UK phone directories, but not
>>>> in other situations.
>>>>
>>>>
>>>> Some practical upshots
>>>> ==================
>>>>
>>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>>>> for any single character it considers valid, or any ByteString it has
>>>> made up of characters it considers valid.
>>>>
>>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>>>> other Smalltalk with a single byte ByteString following ASCII or
>>>> ISO-8859-1.
>>>>
>>>> 3) Any Smalltalk (or derivative language) using ByteString can
>>>> immediately consider it's ByteString as valid UTF-8, as long as it
>>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>>>
>>>> 4) All of those can be successfully exported to any system using UTF-8
>>>> (e.g. HTML).
>>>>
>>>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>>>> a) accept UTF-8 strings with composed characters
>>>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>>>> that use *only* compatibility codepoints.
>>>>
>>>>
>>>> Class + protocol proposals
>>>>
>>>>
>>>>
>>>> a Utf8CompatibilityString class.
>>>>
>>>> asByteString  - ensure only compatibility codepoints are used.
>>>> Ensure it doews not encode characters above 00FF hex.
>>>>
>>>> asIso8859String - ensures only compatibility codepoints are used,
>>>> and that the characters are each valid ISO 8859-1
>>>>
>>>> asAsciiString - ensures only characters 00hex - 7F hex are used.
>>>>
>>>> asUtf8ComposedIso8859String - ensures all compatibility codepoints
>>>> are expanded into small OrderedCollections of codepoints
>>>>
>>>> a Utf8ComposedIso8859String class - will provide sortable and
>>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>>>
>>>> Then a Utf8SortableCollection class - a collection of
>>>> Utf8ComposedIso8859Strings words and phrases.
>>>>
>>>> Custom sortBlocks will define the applicable sort order.
>>>>
>>>> We can create a collection...  a Dictionary, thinking about it, of
>>>> named, prefabricated sortBlocks.
>>>>
>>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>>>
>>>> If anyone has better names for the classes, please let me know.
>>>>
>>>> If anyone else wants to help
>>>>  - build these,
>>>>  - create SUnit tests for these
>>>>  - write documentation for these
>>>> Please let me know.
>>>>
>>>> n.b. I have had absolutely no experience of Ropes.
>>>>
>>>> My own background with this stuff:  In the early 90's as a Project
>>>> Manager implementing office automation systems across a global
>>>> company, with offices in the Americas, Western, Eastern and Central
>>>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>>>> China. The mission-critical application was word-processing.
>>>>
>>>> Our offices were spread around the globe, and we needed those offices
>>>> to successfully exchange documents with their sister offices, and with
>>>> the customers in each region the offices were in.
>>>>
>>>> Unicode was then new, and our platform supplier was the NeXT
>>>> Corporation, who had been founder members in of the Unicode Consortium
>>>> in 1990.
>>>>
>>>> So far: I've read the latest version of the Unicode Standard (v8.0).
>>>> This is freely downloadable.
>>>> I've purchased a paper copy of an earlier release.  New releases
>>>> typically consist additional codespaces (i.e. alphabets).  So old
>>>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>>>> are available second-hand for < $10 / €10).
>>>>
>>>> The typical change with each release is the addition of further
>>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>>>> (I'll be going through my V4.0 just to make sure)
>>>>
>>>> Cheers,
>>>> Euan
>>>>
>>>>
>>>>
>>>>
>>>> On 5 December 2015 at 13:08, stepharo <stepharo at free.fr> wrote:
>>>>> Hi EuanM
>>>>>
>>>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>>>
>>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>>> be done to provide full Unicode support.
>>>>>
>>>>>
>>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>>>> (analyses on different topics)
>>>>> on the pharo github project feel free to add this one there.
>>>>>>
>>>>>>
>>>>>> This seems to me to be an area where it would be best to write it
>>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>>> that most share a common ancestry.
>>>>>>
>>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>>> strings which have ligatures and diacritic characters; and correct
>>>>>> round-tripping of data.
>>>>>
>>>>> Go!
>>>>> My suggestion is
>>>>>  start small
>>>>>  make steady progress
>>>>>  write tests
>>>>>  commit often :)
>>>>>
>>>>> Stef
>>>>>
>>>>> What is the french phoneBook ordering because this is the first time I hear
>>>>> about it.
>>>>>
>>>>>>
>>>>>> Call to action:
>>>>>> ==========
>>>>>>
>>>>>> If you have comments on these proposals - such as "but we already have
>>>>>> that facility" or "the reason we do not have these facilities is
>>>>>> because they are dog-slow" - please let me know them.
>>>>>>
>>>>>> If you would like to help out, please let me know.
>>>>>>
>>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>>> please let me know.
>>>>>>
>>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>>
>>>>>> In the first instance, the initiative's website will be:
>>>>>> http://smalltalk.uk.to/unicode.html
>>>>>>
>>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>>
>>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>>> Except where they prevent us reaching the goals of:
>>>>>> - sortable UTF8 strings
>>>>>> - sortable UTF16 strings
>>>>>> - equivalence testing of 2 UTF8 strings
>>>>>> - equivalence testing of 2 UTF16 strings
>>>>>> - round-tripping UTF8 strings through Smalltalk
>>>>>> - roundtripping UTF16 strings through Smalltalk.
>>>>>> As I understand it, we have limited Unicode support atm.
>>>>>>
>>>>>> Current state of play
>>>>>> ===============
>>>>>> ByteString gets converted to WideString when need is automagically
>>>>>> detected.
>>>>>>
>>>>>> Is there anything else that currently exists?
>>>>>>
>>>>>> Definition of Terms
>>>>>> ==============
>>>>>> A quick definition of terms before I go any further:
>>>>>>
>>>>>> Standard terms from the Unicode standard
>>>>>> ===============================
>>>>>> a compatibility character : an additional encoding of a *normal*
>>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>>
>>>>>> Made-up terms
>>>>>> ============
>>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>>> that is also encoded as a string of codepoints.
>>>>>>
>>>>>> (I tend to use the terms compatibility character and compatibility
>>>>>> codepoint interchangably.  The standard only refers to them as
>>>>>> compatibility characters.  However, the standard is determined to
>>>>>> emphasise that characters are abstract and that codepoints are
>>>>>> concrete.  So I think it is often more useful and productive to think
>>>>>> of compatibility or convenience codepoints).
>>>>>>
>>>>>> a composed character :  a character made up of several codepoints
>>>>>>
>>>>>> Unicode encoding explained
>>>>>> =====================
>>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>>> used for a character which also has a composed form.
>>>>>>
>>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>>> sometimes not.
>>>>>>
>>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>>> stream, it would be represented either by a compatibility character or
>>>>>> by a multi-byte combination.
>>>>>>
>>>>>> Using compatibility characters can prevent proper sorting and
>>>>>> equivalence testing.
>>>>>>
>>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>>> compatibility issues and round-tripping problems.
>>>>>>
>>>>>> Currently my thinking is:
>>>>>>
>>>>>> a Utf8String class
>>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>>> but short arrays of wider strings where necessary
>>>>>> a Utf16String class
>>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>>> but short arrays of wider strings
>>>>>> beginning with a 2-byte endianness indicator.
>>>>>>
>>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>>> compatible.
>>>>>>
>>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>>> for round-tripping.  And where there are multiple convenience
>>>>>> codepoints for a character, that it standardises on one.
>>>>>>
>>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>>> characters.
>>>>>>
>>>>>> We then need methods to convert between the two.
>>>>>>
>>>>>> aUtf8String asUtf8SortableString
>>>>>>
>>>>>> and
>>>>>>
>>>>>> aUtf8SortableString asUtf8String
>>>>>>
>>>>>>
>>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>>> countries have one order in general usage, and another for specific
>>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>>
>>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>>> conversion methods
>>>>>>
>>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>>> seOrder, ukOrder, etc
>>>>>>
>>>>>> along the lines of
>>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>>
>>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>>> then we can perform equivalence testing on them trivially.
>>>>>>
>>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>>> were for a character which has multiple equally-valid alternative
>>>>>> convenience codepoints, and for which the string currently had the
>>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>>> alternative convenience codepoints out of the string, and replacing
>>>>>> them with the chosen approved convenience codepoint.
>>>>>>
>>>>>> aUtf8String cleanUtf8String
>>>>>>
>>>>>> With WideString, a lot of the issues disappear - except
>>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>>> some Unicode characters 5-bytes long.)
>>>>>>
>>>>>>
>>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>>> subtle, or somewhere in between, please let me know)
>>>>>>
>>>>>> Cheers,
>>>>>>   Euan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>> On 07 Dec 2015, at 04:21, EuanM <euanmee at gmail.com> wrote:
>>>>
>>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>>> http://smalltalk.uk.to/unicode-utf8.html
>>>> and my Smalltalk in Small Steps blog at:
>>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>>>
>>>> My current thinking, and understanding.
>>>> ==============================
>>>>
>>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>>   b) UTF-8 can encode all of those characters in 1 byte, but can
>>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>>> can encode additional characters as sequences of multiple bytes.
>>>>
>>>> 1) Smalltalk has long had multiple String classes.
>>>>
>>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>>   is encoded as a UTF-8 codepoint of nn hex.
>>>>
>>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>>> hex and 7E hex, or between A0 hex and FF hex.
>>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>>>
>>>> 4) All valid ASCII characters have a character code between 00 hex and 7E hex.
>>>> https://en.wikipedia.org/wiki/ASCII
>>>>
>>>>
>>>> 5) a) All character codes which are defined within ISO-8859-1 and also
>>>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>>>> defined identically in both.
>>>>
>>>> b) All printable ASCII characters are defined identically in both
>>>> ASCII and ISO-8859-1
>>>>
>>>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>>>> defined identically in Unicode UTF-8.
>>>>
>>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>>>> - FF hex ) are defined identically in UTF-8.
>>>>
>>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>>>        all ASCII maps 1:1 to Unicode UTF-8
>>>>        all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>>>
>>>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>>>> character  or a valid ASCII character are *also* a valid UTF-8
>>>> character.
>>>>
>>>> 10) ISO-8859-1 characters representing a character with a diacritic,
>>>> or a two-character ligature, have no ASCII equivalent.  In Unicode
>>>> UTF-8, those character codes which are representing compound glyphs,
>>>> are called "compatibility codepoints".
>>>>
>>>> 11) The preferred Unicode representation of the characters which have
>>>> compatibility codepoints is as a  a short set of codepoints
>>>> representing the characters which are combined together to form the
>>>> glyph of the convenience codepoint, as a sequence of bytes
>>>> representing the component characters.
>>>>
>>>>
>>>> 12) Some concrete examples:
>>>>
>>>> A - aka Upper Case A
>>>> In ASCII, in ISO 8859-1
>>>> ASCII A - 41 hex
>>>> ISO-8859-1 A - 41 hex
>>>> UTF-8 A - 41 hex
>>>>
>>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
>>>> In ASCII, not in ISO 8859-1
>>>> ASCII : BEL  - 07 hex
>>>> ISO-8859-1 : 07 hex is not a valid character code
>>>> UTF-8 : BEL - 07 hex
>>>>
>>>> £ (GBP currency symbol)
>>>> In ISO-8859-1, not in ASCII
>>>> ASCII : A3 hex is not a valid ASCII code
>>>> UTF-8: £ - A3 hex
>>>> ISO-8859-1: £ - A3 hex
>>>>
>>>> Upper Case C cedilla
>>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
>>>> *and* a composed set of codepoints
>>>> ASCII : C7 hex is not a valid ASCII character code
>>>> ISO-8859-1 : Upper Case C cedilla - C7 hex
>>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
>>>> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>>>>  Upper case C 0043 hex (Upper case C)
>>>>      followed by
>>>>  cedilla 00B8 hex (cedilla)
>>>>
>>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
>>>> aByteString is completely adequate for editing and display.
>>>>
>>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
>>>> string, upper and lower case versions of the same character will be
>>>> treated differently.
>>>>
>>>> 15) When sorting any valid ISO-8859-1 string containing
>>>> letter+diacritic combination glyphs or ligature combination glyphs,
>>>> the glyphs in combination will treated differently to a "plain" glyph
>>>> of the character
>>>> i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
>>>> "fs" will be treated very differently.
>>>>
>>>> 16) Different nations have different rules about where diacritic-ed
>>>> characted and ligature pairs should be placed when in alphabetical
>>>> order.
>>>>
>>>> 17) Some nations even have multiple standards - e.g.  surnames
>>>> beginning either "M superscript-c" or "M superscript-a superscript-c"
>>>> are treated as beginning equivalently in UK phone directories, but not
>>>> in other situations.
>>>>
>>>>
>>>> Some practical upshots
>>>> ==================
>>>>
>>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
>>>> for any single character it considers valid, or any ByteString it has
>>>> made up of characters it considers valid.
>>>>
>>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
>>>> other Smalltalk with a single byte ByteString following ASCII or
>>>> ISO-8859-1.
>>>>
>>>> 3) Any Smalltalk (or derivative language) using ByteString can
>>>> immediately consider it's ByteString as valid UTF-8, as long as it
>>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1.
>>>>
>>>> 4) All of those can be successfully exported to any system using UTF-8
>>>> (e.g. HTML).
>>>>
>>>> 5) To successfully *accept* all UTF-8 we much be able to do either:
>>>> a) accept UTF-8 strings with composed characters
>>>> b) convert UTF-8 strings with composed characters into UTF-8 strings
>>>> that use *only* compatibility codepoints.
>>>>
>>>>
>>>> Class + protocol proposals
>>>>
>>>>
>>>>
>>>> a Utf8CompatibilityString class.
>>>>
>>>>  asByteString  - ensure only compatibility codepoints are used.
>>>> Ensure it doews not encode characters above 00FF hex.
>>>>
>>>>  asIso8859String - ensures only compatibility codepoints are used,
>>>> and that the characters are each valid ISO 8859-1
>>>>
>>>>  asAsciiString - ensures only characters 00hex - 7F hex are used.
>>>>
>>>>  asUtf8ComposedIso8859String - ensures all compatibility codepoints
>>>> are expanded into small OrderedCollections of codepoints
>>>>
>>>> a Utf8ComposedIso8859String class - will provide sortable and
>>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings.
>>>>
>>>> Then a Utf8SortableCollection class - a collection of
>>>> Utf8ComposedIso8859Strings words and phrases.
>>>>
>>>> Custom sortBlocks will define the applicable sort order.
>>>>
>>>> We can create a collection...  a Dictionary, thinking about it, of
>>>> named, prefabricated sortBlocks.
>>>>
>>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.
>>>>
>>>> If anyone has better names for the classes, please let me know.
>>>>
>>>> If anyone else wants to help
>>>>   - build these,
>>>>   - create SUnit tests for these
>>>>   - write documentation for these
>>>> Please let me know.
>>>>
>>>> n.b. I have had absolutely no experience of Ropes.
>>>>
>>>> My own background with this stuff:  In the early 90's as a Project
>>>> Manager implementing office automation systems across a global
>>>> company, with offices in the Americas, Western, Eastern and Central
>>>> Europe, (including Slavic and Cyrillic users) nations, Japan and
>>>> China. The mission-critical application was word-processing.
>>>>
>>>> Our offices were spread around the globe, and we needed those offices
>>>> to successfully exchange documents with their sister offices, and with
>>>> the customers in each region the offices were in.
>>>>
>>>> Unicode was then new, and our platform supplier was the NeXT
>>>> Corporation, who had been founder members in of the Unicode Consortium
>>>> in 1990.
>>>>
>>>> So far: I've read the latest version of the Unicode Standard (v8.0).
>>>> This is freely downloadable.
>>>> I've purchased a paper copy of an earlier release.  New releases
>>>> typically consist additional codespaces (i.e. alphabets).  So old
>>>> copies are useful, as well as cheap.  (Paper copies of  version 4.0
>>>> are available second-hand for < $10 / €10).
>>>>
>>>> The typical change with each release is the addition of further
>>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
>>>> (I'll be going through my V4.0 just to make sure)
>>>>
>>>> Cheers,
>>>>  Euan
>>>>
>>>>
>>>>
>>>>
>>>> On 5 December 2015 at 13:08, stepharo <stepharo at free.fr> wrote:
>>>>> Hi EuanM
>>>>>
>>>>> Le 4/12/15 12:42, EuanM a écrit :
>>>>>>
>>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>>> be done to provide full Unicode support.
>>>>>
>>>>>
>>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap
>>>>> (analyses on different topics)
>>>>> on the pharo github project feel free to add this one there.
>>>>>>
>>>>>>
>>>>>> This seems to me to be an area where it would be best to write it
>>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>>> that most share a common ancestry.
>>>>>>
>>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>>> strings which have ligatures and diacritic characters; and correct
>>>>>> round-tripping of data.
>>>>>
>>>>> Go!
>>>>> My suggestion is
>>>>>   start small
>>>>>   make steady progress
>>>>>   write tests
>>>>>   commit often :)
>>>>>
>>>>> Stef
>>>>>
>>>>> What is the french phoneBook ordering because this is the first time I hear
>>>>> about it.
>>>>>
>>>>>>
>>>>>> Call to action:
>>>>>> ==========
>>>>>>
>>>>>> If you have comments on these proposals - such as "but we already have
>>>>>> that facility" or "the reason we do not have these facilities is
>>>>>> because they are dog-slow" - please let me know them.
>>>>>>
>>>>>> If you would like to help out, please let me know.
>>>>>>
>>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>>> please let me know.
>>>>>>
>>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>>
>>>>>> In the first instance, the initiative's website will be:
>>>>>> http://smalltalk.uk.to/unicode.html
>>>>>>
>>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>>
>>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>>> Except where they prevent us reaching the goals of:
>>>>>>  - sortable UTF8 strings
>>>>>>  - sortable UTF16 strings
>>>>>>  - equivalence testing of 2 UTF8 strings
>>>>>>  - equivalence testing of 2 UTF16 strings
>>>>>>  - round-tripping UTF8 strings through Smalltalk
>>>>>>  - roundtripping UTF16 strings through Smalltalk.
>>>>>> As I understand it, we have limited Unicode support atm.
>>>>>>
>>>>>> Current state of play
>>>>>> ===============
>>>>>> ByteString gets converted to WideString when need is automagically
>>>>>> detected.
>>>>>>
>>>>>> Is there anything else that currently exists?
>>>>>>
>>>>>> Definition of Terms
>>>>>> ==============
>>>>>> A quick definition of terms before I go any further:
>>>>>>
>>>>>> Standard terms from the Unicode standard
>>>>>> ===============================
>>>>>> a compatibility character : an additional encoding of a *normal*
>>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>>
>>>>>> Made-up terms
>>>>>> ============
>>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>>> that is also encoded as a string of codepoints.
>>>>>>
>>>>>> (I tend to use the terms compatibility character and compatibility
>>>>>> codepoint interchangably.  The standard only refers to them as
>>>>>> compatibility characters.  However, the standard is determined to
>>>>>> emphasise that characters are abstract and that codepoints are
>>>>>> concrete.  So I think it is often more useful and productive to think
>>>>>> of compatibility or convenience codepoints).
>>>>>>
>>>>>> a composed character :  a character made up of several codepoints
>>>>>>
>>>>>> Unicode encoding explained
>>>>>> =====================
>>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>>> used for a character which also has a composed form.
>>>>>>
>>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>>> sometimes not.
>>>>>>
>>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>>> stream, it would be represented either by a compatibility character or
>>>>>> by a multi-byte combination.
>>>>>>
>>>>>> Using compatibility characters can prevent proper sorting and
>>>>>> equivalence testing.
>>>>>>
>>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>>> compatibility issues and round-tripping problems.
>>>>>>
>>>>>> Currently my thinking is:
>>>>>>
>>>>>> a Utf8String class
>>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>>> but short arrays of wider strings where necessary
>>>>>> a Utf16String class
>>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>>> but short arrays of wider strings
>>>>>> beginning with a 2-byte endianness indicator.
>>>>>>
>>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>>> compatible.
>>>>>>
>>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>>> for round-tripping.  And where there are multiple convenience
>>>>>> codepoints for a character, that it standardises on one.
>>>>>>
>>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>>> characters.
>>>>>>
>>>>>> We then need methods to convert between the two.
>>>>>>
>>>>>> aUtf8String asUtf8SortableString
>>>>>>
>>>>>> and
>>>>>>
>>>>>> aUtf8SortableString asUtf8String
>>>>>>
>>>>>>
>>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>>> countries have one order in general usage, and another for specific
>>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>>
>>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>>> conversion methods
>>>>>>
>>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>>> seOrder, ukOrder, etc
>>>>>>
>>>>>> along the lines of
>>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>>
>>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>>> then we can perform equivalence testing on them trivially.
>>>>>>
>>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>>> were for a character which has multiple equally-valid alternative
>>>>>> convenience codepoints, and for which the string currently had the
>>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>>> alternative convenience codepoints out of the string, and replacing
>>>>>> them with the chosen approved convenience codepoint.
>>>>>>
>>>>>> aUtf8String cleanUtf8String
>>>>>>
>>>>>> With WideString, a lot of the issues disappear - except
>>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>>> some Unicode characters 5-bytes long.)
>>>>>>
>>>>>>
>>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>>> subtle, or somewhere in between, please let me know)
>>>>>>
>>>>>> Cheers,
>>>>>>    Euan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>


More information about the Squeak-dev mailing list