[squeak-dev] Re: [Pharo-dev] Unicode Support

EuanM euanmee at gmail.com
Fri Dec 11 02:43:49 UTC 2015


I agree with all of that, Ben.

I'm currently fairly certain that fully-composed abstract characters
is a term that is 1:1 mapped with the term "grapheme cluster"  (i.e.
one is an older Unicode description of a newer Unicode term).

And once we create these, I think this sort of implementation is
straightforward.  For particular values of "straightforward", of
course :-)

i.e. the Swift approach is equivalent to the approach I originally
proposed and asked for critiques of.

One thing I don't understand....  why does the fact the composed
abstract character (aka grapheme cluster) is a sequence mean that an
array cannot be used to hold the sequence?

If people then also want a compatibility-codepoints-only UTF-8
representation, it is simple to provide comparable (i.e
equivalence-testable) versions of any UTF-8 string - because we are
creating them from composed forms by a *single* defined method.

For my part, the reason I think we ought to implement it *in*
Smalltalk is ...  this is the String class of the new age.  I want
Smalltalk to be handle Strings as native objects.


On 10 December 2015 at 23:41, Ben Coman <btc at openinworld.com> wrote:
> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
> <guillermopolito at gmail.com> wrote:
>>
>>> On 8 dic 2015, at 10:07 p.m., EuanM <euanmee at gmail.com> wrote:
>>>
>>> "No. a codepoint is the numerical value assigned to a character. An
>>> "encoded character" is the way a codepoint is represented in bytes
>>> using a given encoding."
>>>
>>> No.
>>>
>>> A codepoint may represent a component part of an abstract character,
>>> or may represent an abstract character, or it may do both (but not
>>> always at the same time).
>>>
>>> Codepoints represent a single encoding of a single concept.
>>>
>>> Sometimes that concept represents a whole abstract character.
>>> Sometimes it represent part of an abstract character.
>>
>> Well. I do not agree with this. I agree with the quote.
>>
>> Can you explain a bit more about what you mean by abstract character and concept?
>
> This seems to be what Swift is doing, where Strings are not composed
> not of codepoints but of graphemes.
>
>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>
> ** i.e. not an array
>
>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>
>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>
>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>
> Indeed I've tried searched for what problems it causes and get a null
> result.  So I read  *all*good*  things about Swift's unicode
> implementation reducing common errors dealing with Unicode.  Can
> anyone point to complaints about Swift's unicode implementation?
> Maybe this...
>
>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>
> Considering our common pattern: Make it work, Make it right, Make it
> fast  -- maybe Strings as arrays are a premature optimisation, that
> was the right choice in the past prior to Unicode, but considering
> Moore's Law versus programmer time, is not the best choice now.
> Should we at least start with a UnicodeString and UnicodeCharacter
> that operates like Swift, and over time *maybe* move the tools to use
> them.
>
> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
> [2] http://oleb.net/blog/2014/07/swift-strings/
>
> cheers -ben
>
>>
>>>
>>> This is the key difference between Unicode and most character encodings.
>>>
>>> A codepoint does not always represent a whole character.
>>>
>>> On 7 December 2015 at 13:06, Henrik Johansen
>


More information about the Squeak-dev mailing list