[squeak-dev] Re: [Pharo-dev] Unicode Support

Wed Dec 9 13:16:18 UTC 2015

"To encode Unicode for external representation as bytes, we use UTF-8
like the rest of the modern world.

So far, so good.

Why all the confusion ?"

The confusion arises because simply providing *a* valid UTF-8 encoding
of does not ensure sortability, nor equivalence testability.

It might provide sortable strings. It might not.

It might provide a string that can be compared to another string
successfully.  It might not.

So being able to perform valid UTF-8 encoding is *necessary*, but *not
sufficient*.

i.e. the confusion arises because UTF-8 can provide for several
competing, non-sortable encodings of even a single character.  This
means that *valid* UTF-8 cannot be relied upon to provide these
facilities *unless* all the UTF-8 strings can be relied upon to have
been encoded to UTF-8 by the same specification of process.  i.e.
*unless* it has gone through a process of being converted by *a
specific* valid method of encoding to UTF-8.

Understanding the concept of abstract character is, imo key to
understanding the differences between the various valid UTF-8 forms of
a given abstract character.

Cheers,
    Euan

On 9 December 2015 at 10:45, Sven Van Caekenberghe <sven at stfx.eu> wrote:
>
>> On 09 Dec 2015, at 10:35, Guillermo Polito <guillermopolito at gmail.com> wrote:
>>
>>
>>> On 8 dic 2015, at 10:07 p.m., EuanM <euanmee at gmail.com> wrote:
>>>
>>> "No. a codepoint is the numerical value assigned to a character. An
>>> "encoded character" is the way a codepoint is represented in bytes
>>> using a given encoding."
>>>
>>> No.
>>>
>>> A codepoint may represent a component part of an abstract character,
>>> or may represent an abstract character, or it may do both (but not
>>> always at the same time).
>>>
>>> Codepoints represent a single encoding of a single concept.
>>>
>>> Sometimes that concept represents a whole abstract character.
>>> Sometimes it represent part of an abstract character.
>>
>> Well. I do not agree with this. I agree with the quote.
>>
>> Can you explain a bit more about what you mean by abstract character and concept?
>
> I am pretty sure that this whole discussion does more harm than good for most people's understanding of Unicode.
>
> It is best and (mostly) correct to think of a Unicode string as a sequence of Unicode characters, each defined/identified by a code point (out of 10.000s covering all languages). That is what we have today in Pharo (with the distinction between ByteString and WideString as mostly invisible implementation details).
>
> To encode Unicode for external representation as bytes, we use UTF-8 like the rest of the modern world.
>
> So far, so good.
>
> Why all the confusion ? Because the world is a complex place and the Unicode standard tries to cover all possible things. Citing all these exceptions and special cases will make people crazy and give up. I am sure that most stopped reading this thread.
>
> Why then is there confusion about the seemingly simple concept of a character ? Because Unicode allows different ways to say the same thing. The simplest example in a common language is (the French letter é) is
>
> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>
> which can also be written as
>
> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>
> The former being a composed normal form, the latter a decomposed normal form. (And yes, it is even much more complicated than that, it goes on for 1000s of pages).
>
> In the above example, the concept of character/string is indeed fuzzy.
>
> HTH,
>
> Sven
>
>