[squeak-dev] [Unicode] Summary (Re: [Pharo-dev] Unicode Support // e acute example --> decomposition in Pharo?)

H. Hirzel hannes.hirzel at gmail.com
Fri Dec 18 13:47:24 UTC 2015


Hello Sven

Thank you for your report about about  your experimental, proof of
concept, prototype project, that aims to improve Unicode support.
Please include me in the loop.

Below is is my attempt at summarizing the Unicode discussion of the last weeks.
Corrections /comments / additions are welcome.

Kind regards

Hannes


1) There is a need for improved Unicode support implemented _within_
the image , probably as a library.

1a) This follows the example of the the Twitter CLDR library (i.e.
re-implementation of ICU components for Ruby).
https://github.com/twitter/twitter-cldr-rb

Other languages/libraries have similar approaches
- dotNet, https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx)
- Python https://docs.python.org/3/howto/unicode.html
- Go http://blog.golang.org/strings
- Swift, https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
- Perl http://blog.golang.org/strings

1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This
is because of security and portability reasons (Eliot Miranda) and
because of the Smalltalk approach that wants to expose basic
algorithms in Smalltalk code. In addition the 16bit based ICU library
does not fit well with the Squeak/Pharo UTF32 model.

2) The Unicode infrastructure (21(32) bit wide Characters as immediate
objects, use of UTF-32 internally, indexable strings, UTF8 for outside
communication, support of code converters) is a very valuable
foundation which makes algorithms more straightforward at the expense
of a more memory usage. It not used to its full potential at all
currently though a lot of hard work has been done.

3) The Unicode algorithms are mostly table / database driven. This
means that dictionary lookup is a prominent part of the algorithms.
The essential building block for this is that the Unicode character
database UCD  (http://www.unicode.org/ucd/) is made  available
_within_ the image with the full content as needed by the target
languages / scripts one wants to deal with. The process of loading the
UCD should be made configurable.

3a) a lot of people are interested in the Latin script (and scripts of
similar complexity) only.
3b) The UCD data in XML form
http://www.unicode.org/Public/8.0.0/ucdxml/  offers a download with
and without the CJK characters.

4) The next step is to implement normalization
(http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that
you have reached results here with the test data:
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt.

5) Pharo offers nice inspectors to view dictionaries and ordered
collections (table view, drill down) which facilitates the development
to table driven algorithms. The data structures and algorithm are do
not depend on a particular dialect though and may be ported to Squeak
or Cuis.

6) After having implemented normalization, comparison may be
implemented. This needs CLDR access (collation, Unicode Common Locale
Data Repository, http://cldr.unicode.org/ ).


7) An architecture has the following subsystems

7a) Basic character handling (21(32)bit characters in indexable
strings, point 2)
7b) Runtime access to the Unicode Character Database (point 3)
7c) Converters
7d) Normalization (point 4)
7e) CLDR access (point 6)


8) The implementation should be driven by the current needs.

An attainable next goal is to release

8a) a StringBuilder utility class for easier construction of test strings
i.e. instead of

> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
> String).

do
normalizer composeString:
(StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee')

and construct some test cases with it which illustrate some basic
Unicode issues.

8b) identity testing for major languages (e.g. French, German,
Spanish) and scripts of similar complexity. I

8c) to provide some more documentation of past and concurrent efforts.

Note: This summary has only covered string manipulation, not rendering
on the screen which is a different issue.


On 12/16/15, Sven Van Caekenberghe <sven at stfx.eu> wrote:
> Hi Hannes,
>
> My detailed comments/answers below, after quoting 2 of your emails:
>
>> On 10 Dec 2015, at 22:17, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>>
>> Hello Sven
>>
>> On 12/9/15, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>>
>>> The simplest example in a common language is (the French letter é) is
>>>
>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>>
>>> which can also be written as
>>>
>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT
>>> [U+0301]
>>>
>>> The former being a composed normal form, the latter a decomposed normal
>>> form. (And yes, it is even much more complicated than that, it goes on
>>> for
>>> 1000s of pages).
>>>
>>> In the above example, the concept of character/string is indeed fuzzy.
>>>
>>> HTH,
>>>
>>> Sven
>>
>> Thanks for this example. I have created a wiki page with it
>>
>> I wonder what the Pharo equivalent is of the following Squeak expression
>>
>>    $é asString asDecomposedUnicode
>>
>> Regards
>>
>> Hannes
>
> You also wrote:
>
>> The text below shows how to deal with the  Unicode e acute example
>> brought up by Sven in terms of comparing strings. Currently Pharo and
>> Cuis do not do Normalization of strings. Limited support is in Squeak.
>> It will be shown how NFD normalization may be implemented.
>>
>>
>> Swift programming language
>> -----------------------------------------
>>
>> How does the Swift programming language [1] deal with Unicode strings?
>>
>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
>>    let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"
>>
>>    // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
>> COMBINING ACUTE ACCENT
>>    let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"
>>
>>    if eAcuteQuestion == combinedEAcuteQuestion {
>>    print("These two strings are considered equal")
>>    }
>>    // prints "These two strings are considered equal"
>>
>> The equality operator uses the NFD (Normalization Form Decomposed)[2]
>> form for the comparison appyling a method
>> #decomposedStringWithCanonicalMapping[3]
>>
>>
>> Squeak / Pharo
>> -----------------------
>>
>> Comparison without NFD [3]
>>
>>
>> "Voulez-vous un café?"
>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>> asString, '?'.
>>
>>
>> eAcuteQuestion = combinedEAcuteQuestion
>> false
>>
>> eAcuteQuestion == combinedEAcuteQuestion
>> false
>>
>> The result is false. A Unicode conformant application however should
>> return *true*.
>>
>> Reason for this is that  Squeak / Pharo strings are not put into NFD
>> before  testing for equality =
>>
>>
>> Squeak Unicode strings may be tested for Unicode conformant equality
>> by converting them to NFD before testing.
>>
>>
>>
>> Squeak using NFD
>>
>> asDecomposedUnicode[4] transforms a string into NFD for cases where a
>> Unicode code point if decomposed, is decomposed only to two code
>> points [5]. This is so because when initializing [6] the Unicode
>> Character Database in Squeak this is a limitation imposed by the code
>> which reads UnicodeData.txt [7][8]. This is not a necessary
>> limitation. The code may be rewritten at the price of a more complex
>> implementation of #asDecomposedUnicode.
>>
>> "Voulez-vous un café?"
>> eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
>> asString, '?'.
>>
>>
>> eAcuteQuestion asDecomposedUnicode =
>>    combinedEAcuteQuestion  asDecomposedUnicode
>>
>> true
>>
>>
>>
>> Conclusion
>> ------------------
>>
>> Implementing a method like #decomposedStringWithCanonicalMapping
>> (swift) which puts a string into NFD (Normalization Form D) is an
>> important building block towards better Unicode compliance. A Squeak
>> proposal is given by [4]. It needs to be reviewed.extended.
>>
>> It should probably  be extended for cases where there are more than
>> two code points in the decomposed form (3 or more?)
>>
>> The implementing of NFD comparison gives us an equality test for a
>> comparatively small effort for simple cases covering a large number of
>> use cases (Languages using the Latin script).
>>
>> The algorithm is table driven by the UCD [8]. From this follows an
>> simple but important fact for conformant implementations need runtime
>> access to information from the Unicode Character Database [UCD][9].
>>
>>
>> [1]
>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
>> [2] http://www.unicode.org/glossary/#normalization_form_d
>> [3]
>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
>> [5] http://www.unicode.org/glossary/#code_point
>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248
>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>> [8] Unicode Character Database documentation
>> http://unicode.org/reports/tr44/
>> [9] http://www.unicode.org/reports/tr23/
>
>
> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is
> different but similar Unicode code in Squeak. These are too simple (even
> though they might work, partially).
>
> The scope of the original threads is way too wide: a new string type,
> normalisation, collation, being cross dialect, mixing all kinds of character
> and encoding definitions. All interesting, but not much will come out of it.
> But the point that we cannot leave proper text string handling to an outside
> library is indeed key.
>
> That is why a couple of people in the Pharo community (myself included)
> started an experimental, proof of concept, prototype project, that aims to
> improve Unicode support. We will announce it to a wider public when we feel
> we have something to show for. The goal is in the first place to understand
> and implement the fundamental algorithms, starting with the 4 forms of
> Normalisation. But we're working on collation/sorting too.
>
> This work is of course being done for/in Pharo, using some of the facilities
> only available there. It probably won't be difficult to port, but we can't
> be bothered with probability right now.
>
> What we started with is loading UCD data and making it available as a nice
> objects (30.000 of them).
>
> So now you can do things like
>
> $é unicodeCharacterData.
>
>  => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)"
>
> $é unicodeCharacterData uppercase asCharacter.
>
>  => "$É"
>
> $é unicodeCharacterData decompositionMapping.
>
>  => "#(101 769)"
>
> There is also a cool GT Inspector view:
>
>
>
> Next we started implementing a normaliser. It was rather easy to get support
> for simpler languages going. The next code snippets use explicit code
> arrays, because copying decomposed diacritics to my mail client does not
> work (they get automatically composed), in a Pharo Workspace this does work
> nicely with plain strings. The higher numbers are the diacritics.
>
> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as:
> Array.
>
>  => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99
> 807 97 105 115)"
>
> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint
> as: Array.
>
>  => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103
> 115 97 108 108 101 101)"
>
> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115
> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String).
>
>  => "'les élèves Français'"
>
> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32
> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as:
> String).
>
>  => "'Düsseldorf Königsallee'"
>
> However, the real algorithm following the official specification (and other
> elements of Unicode that interact with it) is way more complicated (think
> about all those special languages/scripts out there). We're focused on
> understanding/implementing that now.
>
> Next, unit tests were added (of course). As well as a test that uses
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about
> 75.000 individual test cases to check conformance to the official Unicode
> Normalization specification.
>
> Right now (with super cool hangul / jamo code by Henrik), we hit the
> following stats:
>
> #testNFC 16998/18593 (91.42%)
> #testNFD 16797/18593 (90.34%)
> #testNFKC 13321/18593 (71.65%)
> #testNFKD 16564/18593 (89.09%)
>
> Way better than the naive implementations, but not yet there.
>
> We are also experimenting and thinking a lot about how to best implement all
> this, trying out different models/ideas/apis/representations.
>
> It will move slowly, but you will hear from us again in the coming
> weeks/months.
>
> Sven
>
> PS: Pharo developers with a good understanding of this subject area that
> want to help, let me know and we'll put you in the loop. Hacking and
> specification reading are required ;-)
>
>


More information about the Squeak-dev mailing list