[Pharo-dev] [squeak-dev] Re: Unicode Support

EuanM euanmee at gmail.com
Fri Dec 11 20:20:21 UTC 2015


Eliot - thank you for explaining to me why my original idea was bad.  :-)

I always assumed it would be.  Otherwise I'd've just built it the way
I proposed.

I'm thoroughly delighted to have more knowledgeable people contributing.

On 11 December 2015 at 09:29, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> Hi Euan,
>
>> On Dec 10, 2015, at 6:43 PM, EuanM <euanmee at gmail.com> wrote:
>>
>> I agree with all of that, Ben.
>>
>> I'm currently fairly certain that fully-composed abstract characters
>> is a term that is 1:1 mapped with the term "grapheme cluster"  (i.e.
>> one is an older Unicode description of a newer Unicode term).
>>
>> And once we create these, I think this sort of implementation is
>> straightforward.  For particular values of "straightforward", of
>> course :-)
>>
>> i.e. the Swift approach is equivalent to the approach I originally
>> proposed and asked for critiques of.
>>
>> One thing I don't understand....  why does the fact the composed
>> abstract character (aka grapheme cluster) is a sequence mean that an
>> array cannot be used to hold the sequence?
>
> Of course an Array can be used, but one good reason to use bits organized as four-byte units is that the garbage collector spends no time scanning them, whereas as far as its concerned the Array representation is all objects and must be scanned.  Another reason is that foreign code may find the bits representation compatible and so they can be passed through the FFI to other languages whereas the Array of tagged characters will always require conversion.  Yet another reason is that in 64-bits the Array takes twice the space of the bits object.
>
>> If people then also want a compatibility-codepoints-only UTF-8
>> representation, it is simple to provide comparable (i.e
>> equivalence-testable) versions of any UTF-8 string - because we are
>> creating them from composed forms by a *single* defined method.
>>
>> For my part, the reason I think we ought to implement it *in*
>> Smalltalk is ...  this is the String class of the new age.  I want
>> Smalltalk to be handle Strings as native objects.
>
> There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks  in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter.  So there's nothing more "natively objecty" about the Array.  Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.
>
> Given that this is a dynamically-typed language there's nothing to prevent one providing both implementations beyond maintenance cost and complexity/confusion.  So at least it's easy to do performance comparisons between the two.   But I still think the bits representation is superior if what you want is a sequence of Characters.
>
>>> On 10 December 2015 at 23:41, Ben Coman <btc at openinworld.com> wrote:
>>> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
>>> <guillermopolito at gmail.com> wrote:
>>>>
>>>>> On 8 dic 2015, at 10:07 p.m., EuanM <euanmee at gmail.com> wrote:
>>>>>
>>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>>> "encoded character" is the way a codepoint is represented in bytes
>>>>> using a given encoding."
>>>>>
>>>>> No.
>>>>>
>>>>> A codepoint may represent a component part of an abstract character,
>>>>> or may represent an abstract character, or it may do both (but not
>>>>> always at the same time).
>>>>>
>>>>> Codepoints represent a single encoding of a single concept.
>>>>>
>>>>> Sometimes that concept represents a whole abstract character.
>>>>> Sometimes it represent part of an abstract character.
>>>>
>>>> Well. I do not agree with this. I agree with the quote.
>>>>
>>>> Can you explain a bit more about what you mean by abstract character and concept?
>>>
>>> This seems to be what Swift is doing, where Strings are not composed
>>> not of codepoints but of graphemes.
>>>
>>>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>>>
>>> ** i.e. not an array
>>>
>>>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>>>
>>>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>>>
>>>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>>>
>>> Indeed I've tried searched for what problems it causes and get a null
>>> result.  So I read  *all*good*  things about Swift's unicode
>>> implementation reducing common errors dealing with Unicode.  Can
>>> anyone point to complaints about Swift's unicode implementation?
>>> Maybe this...
>>>
>>>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>>>
>>> Considering our common pattern: Make it work, Make it right, Make it
>>> fast  -- maybe Strings as arrays are a premature optimisation, that
>>> was the right choice in the past prior to Unicode, but considering
>>> Moore's Law versus programmer time, is not the best choice now.
>>> Should we at least start with a UnicodeString and UnicodeCharacter
>>> that operates like Swift, and over time *maybe* move the tools to use
>>> them.
>>>
>>> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>>> [2] http://oleb.net/blog/2014/07/swift-strings/
>>>
>>> cheers -ben
>>>
>>>>
>>>>>
>>>>> This is the key difference between Unicode and most character encodings.
>>>>>
>>>>> A codepoint does not always represent a whole character.
>>>>>
>>>>> On 7 December 2015 at 13:06, Henrik Johansen
>
> _,,,^..^,,,_ (phone)


More information about the Squeak-dev mailing list