[squeak-dev] [Unicode] Summary (Re: [Pharo-dev] Unicode Support // e acute example --> decomposition in Pharo?)

Sat Dec 19 02:04:45 UTC 2015

So a lot of Windows APIs require UTF-16.  What's up with UTF-8 being the 
only choice mentioned for external communication?

Unicode string encodings like UTF-* and strings of "characters" (that 
is, sequences of Unicode code points) should be clearly distinguished. 
Do you really mean "UTF-32", or do you mean "UCS-4"?  Even those two are 
not exactly the same.

On 12/18/15 5:47 , H. Hirzel wrote:
> Hello Sven
>
> Thank you for your report about about  your experimental, proof of
> concept, prototype project, that aims to improve Unicode support.
> Please include me in the loop.
>
> Below is is my attempt at summarizing the Unicode discussion of the last weeks.
> Corrections /comments / additions are welcome.
>
> Kind regards
>
> Hannes
>
>
> 1) There is a need for improved Unicode support implemented _within_
> the image , probably as a library.
>
> 1a) This follows the example of the the Twitter CLDR library (i.e.
> re-implementation of ICU components for Ruby).
> https://github.com/twitter/twitter-cldr-rb
>
> Other languages/libraries have similar approaches
> - dotNet, https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
> - Python https://docs.python.org/3/howto/unicode.html
> - Go http://blog.golang.org/strings
> - Swift, https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
> - Perl http://blog.golang.org/strings
>
> 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
> is because of security and portability reasons (Eliot Miranda) and
> because of the Smalltalk approach that wants to expose basic
> algorithms in Smalltalk code. In addition the 16bit based ICU library
> does not fit well with the Squeak/Pharo UTF32 model.
>
> 2) The Unicode infrastructure (21(32) bit wide Characters as immediate
> objects, use of UTF-32 internally, indexable strings, UTF8 for outside
> communication, support of code converters) is a very valuable
> foundation which makes algorithms more straightforward at the expense
> of a more memory usage. It not used to its full potential at all
> currently though a lot of hard work has been done.
>
> 3) The Unicode algorithms are mostly table / database driven. This
> means that dictionary lookup is a prominent part of the algorithms.
> The essential building block for this is that the Unicode character
> database UCD  (http://www.unicode.org/ucd/) is made  available
> _within_ the image with the full content as needed by the target
> languages / scripts one wants to deal with. The process of loading the
> UCD should be made configurable.
>
> 3a) a lot of people are interested in the Latin script (and scripts of
> similar complexity) only.
> 3b) The UCD data in XML form
> http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
> and without the CJK characters.
>
> 4) The next step is to implement normalization
> (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
> you have reached results here with the test data:
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.
>
> 5) Pharo offers nice inspectors to view dictionaries and ordered
> collections (table view, drill down) which facilitates the development
> to table driven algorithms. The data structures and algorithm are do
> not depend on a particular dialect though and may be ported to Squeak
> or Cuis.
>
> 6) After having implemented normalization, comparison may be
> implemented. This needs CLDR access (collation, Unicode Common Locale
> Data Repository, http://cldr.unicode.org/ ).
>
>
> 7) An architecture has the following subsystems
>
> 7a) Basic character handling (21(32)bit characters in indexable
> strings, point 2)
> 7b) Runtime access to the Unicode Character Database (point 3)
> 7c) Converters
> 7d) Normalization (point 4)
> 7e) CLDR access (point 6)
>
>
> 8) The implementation should be driven by the current needs.
>
> An attainable next goal is to release
>
> 8a) a StringBuilder utility class for easier construction of test strings
> i.e. instead of
>
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
>
> do
> normalizer composeString:
> (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')
>
> and construct some test cases with it which illustrate some basic
> Unicode issues.
>
> 8b) identity testing for major languages (e.g. French, German,
> Spanish) and scripts of similar complexity. I
>
> 8c) to provide some more documentation of past and concurrent efforts.
>
> Note: This summary has only covered string manipulation, not rendering
> on the screen which is a different issue.
>
>
> On 12/16/15, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>> Hi Hannes,
>>
>> My detailed comments/answers below, after quoting 2 of your emails:
>>
>>> On 10 Dec 2015, at 22:17, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>>>
>>> Hello Sven
>>>
>>> On 12/9/15, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>>>
>>>> The simplest example in a common language is (the French letter é) is
>>>>
>>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>>
>>>> which can also be written as
>>>>
>>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>>> [U+0301]
>>>>
>>>> The former being a composed normal form, the latter a decomposed normal
>>>> form. (And yes, it is even much more complicated than that, it goes on
>>>> for
>>>> 1000s of pages).
>>>>
>>>> In the above example, the concept of character/string is indeed fuzzy.
>>>>
>>>> HTH,
>>>>
>>>> Sven
>>>
>>> Thanks for this example. I have created a wiki page with it
>>>
>>> I wonder what the Pharo equivalent is of the following Squeak expression
>>>
>>>     $é asString asDecomposedUnicode
>>>
>>> Regards
>>>
>>> Hannes
>>
>> You also wrote:
>>
>>> The text below shows how to deal with the  Unicode e acute example
>>> brought up by Sven in terms of comparing strings. Currently Pharo and
>>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>>> It will be shown how NFD normalization may be implemented.
>>>
>>>
>>> Swift programming language
>>> -----------------------------------------
>>>
>>> How does the Swift programming language [1] deal with Unicode strings?
>>>
>>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>>     let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>>
>>>     // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>>> COMBINING ACUTE ACCENT
>>>     let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>>
>>>     if eAcuteQuestion == combinedEAcuteQuestion {
>>>     print("These two strings are considered equal")
>>>     }
>>>     // prints "These two strings are considered equal"
>>>
>>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>>> form for the comparison appyling a method
>>> #decomposedStringWithCanonicalMapping[3]
>>>
>>>
>>> Squeak / Pharo
>>> -----------------------
>>>
>>> Comparison without NFD [3]
>>>
>>>
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>>
>>>
>>> eAcuteQuestion = combinedEAcuteQuestion
>>> false
>>>
>>> eAcuteQuestion == combinedEAcuteQuestion
>>> false
>>>
>>> The result is false. A Unicode conformant application however should
>>> return *true*.
>>>
>>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>>> before  testing for equality =
>>>
>>>
>>> Squeak Unicode strings may be tested for Unicode conformant equality
>>> by converting them to NFD before testing.
>>>
>>>
>>>
>>> Squeak using NFD
>>>
>>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>>> Unicode code point if decomposed, is decomposed only to two code
>>> points [5]. This is so because when initializing [6] the Unicode
>>> Character Database in Squeak this is a limitation imposed by the code
>>> which reads UnicodeData.txt [7][8]. This is not a necessary
>>> limitation. The code may be rewritten at the price of a more complex
>>> implementation of #asDecomposedUnicode.
>>>
>>> "Voulez-vous un café?"
>>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>>> asString, '?'.
>>>
>>>
>>> eAcuteQuestion asDecomposedUnicode =
>>>     combinedEAcuteQuestion  asDecomposedUnicode
>>>
>>> true
>>>
>>>
>>>
>>> Conclusion
>>> ------------------
>>>
>>> Implementing a method like #decomposedStringWithCanonicalMapping
>>> (swift) which puts a string into NFD (Normalization Form D) is an
>>> important building block towards better Unicode compliance. A Squeak
>>> proposal is given by [4]. It needs to be reviewed.extended.
>>>
>>> It should probably  be extended for cases where there are more than
>>> two code points in the decomposed form (3 or more?)
>>>
>>> The implementing of NFD comparison gives us an equality test for a
>>> comparatively small effort for simple cases covering a large number of
>>> use cases (Languages using the Latin script).
>>>
>>> The algorithm is table driven by the UCD [8]. From this follows an
>>> simple but important fact for conformant implementations need runtime
>>> access to information from the Unicode Character Database [UCD][9].
>>>
>>>
>>> [1]
>>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>>> [2] http://www.unicode.org/glossary/#normalization_form_d
>>> [3]
>>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>>> [5] http://www.unicode.org/glossary/#code_point
>>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>> [8] Unicode Character Database documentation
>>> http://unicode.org/reports/tr44/
>>> [9] http://www.unicode.org/reports/tr23/
>>
>>
>> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
>> different but similar Unicode code in Squeak. These are too simple (even
>> though they might work, partially).
>>
>> The scope of the original threads is way too wide: a new string type,
>> normalisation, collation, being cross dialect, mixing all kinds of character
>> and encoding definitions. All interesting, but not much will come out of it.
>> But the point that we cannot leave proper text string handling to an outside
>> library is indeed key.
>>
>> That is why a couple of people in the Pharo community (myself included)
>> started an experimental, proof of concept, prototype project, that aims to
>> improve Unicode support. We will announce it to a wider public when we feel
>> we have something to show for. The goal is in the first place to understand
>> and implement the fundamental algorithms, starting with the 4 forms of
>> Normalisation. But we're working on collation/sorting too.
>>
>> This work is of course being done for/in Pharo, using some of the facilities
>> only available there. It probably won't be difficult to port, but we can't
>> be bothered with probability right now.
>>
>> What we started with is loading UCD data and making it available as a nice
>> objects (30.000 of them).
>>
>> So now you can do things like
>>
>> $é unicodeCharacterData.
>>
>>   => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>>
>> $é unicodeCharacterData uppercase asCharacter.
>>
>>   => "$É"
>>
>> $é unicodeCharacterData decompositionMapping.
>>
>>   => "#(101 769)"
>>
>> There is also a cool GT Inspector view:
>>
>>
>>
>> Next we started implementing a normaliser. It was rather easy to get support
>> for simpler languages going. The next code snippets use explicit code
>> arrays, because copying decomposed diacritics to my mail client does not
>> work (they get automatically composed), in a Pharo Workspace this does work
>> nicely with plain strings. The higher numbers are the diacritics.
>>
>> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
>> Array.
>>
>>   => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
>> 807 97 105 115)"
>>
>> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
>> as: Array.
>>
>>   => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
>> 115 97 108 108 101 101)"
>>
>> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
>> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>>
>>   => "'les élèves Français'"
>>
>> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
>> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
>> String).
>>
>>   => "'Düsseldorf Königsallee'"
>>
>> However, the real algorithm following the official specification (and other
>> elements of Unicode that interact with it) is way more complicated (think
>> about all those special languages/scripts out there). We're focused on
>> understanding/implementing that now.
>>
>> Next, unit tests were added (of course). As well as a test that uses
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
>> 75.000 individual test cases to check conformance to the official Unicode
>> Normalization specification.
>>
>> Right now (with super cool hangul / jamo code by Henrik), we hit the
>> following stats:
>>
>> #testNFC 16998/18593 (91.42%)
>> #testNFD 16797/18593 (90.34%)
>> #testNFKC 13321/18593 (71.65%)
>> #testNFKD 16564/18593 (89.09%)
>>
>> Way better than the naive implementations, but not yet there.
>>
>> We are also experimenting and thinking a lot about how to best implement all
>> this, trying out different models/ideas/apis/representations.
>>
>> It will move slowly, but you will hear from us again in the coming
>> weeks/months.
>>
>> Sven
>>
>> PS: Pharo developers with a good understanding of this subject area that
>> want to help, let me know and we'll put you in the loop. Hacking and
>> specification reading are required ;-)
>>
>>
>
> .
>