[squeak-dev] Re: [Pharo-dev] Unicode Support

EuanM euanmee at gmail.com
Mon Dec 7 12:05:12 UTC 2015


Hi Henry,

To be honest, at some point I'm going to long for the for the much
more succinct semantics of healthcare systems and sports scoring and
administration systems again.  :-)

codepoints are any of *either*
  - the representation of a component of an abstract character, *or*
eg. "A" #(0041) as a component of
  - the sole representation of the whole of an abstract character *or* of
 -  a representation of an abstract character provided for backwards
compatibility which is more properly represented by a series of
codepoints representing a composed character

e.g.

The "A" #(0041) as a codepoint can be:
the sole representation of the whole of an abstract character "A" #(0041)

The representation of a component of the composed (i.e. preferred)
version of the abstract character Å #(0041 030a)

Å (#00C5) represents one valid compatibility form of the abstract
character Å which is most properly represented by #(0041 030a).

Å (#212b) also represents one valid compatibility form of the abstract
character Å which is most properly represented by #(0041 030a).

With any luck, this satisfies both our semantic understandings of the
concept of "codepoint"

Would you agree with that?

In Unicode, codepoints are *NOT* an abstract numerical representation
of a text character.

At least not as we generally understand the term "text character" from
our experience of non-Unicode character mappings.

codepoints represent "*encoded characters*" and "a *text element* ...
is represented by a sequence of one or more codepoints".  (And the
term "text element" is deliberately left undefined in the Unicode
standard)

Individual codepoints are very often *not* the encoded form of an
abstract character that we are interested in.  Unless we are
communicating to or from another system  (Which in some cases is the
Smalltalk ByteString class)

i.e. in other words

*Some* individual codepoints *may* be a representation of a specific
*abstract character*, but only in special cases.

The general case in Unicode is that Unicode defines (a)
representation(s) of a Unicode *abstract character*.

The Unicode standard representation of an abstract character is a
composed sequence of codepoints, where in some cases that sequence is
as short as 1 codepoint.

In other cases, Unicode has a compatibility alias of a single
codepoint which is *also* a representation of an abstract character

There are some cases where an abstract character can be represented by
more than one single-codepoint compatibility codepoint.

Cheers,
  Euan

On 7 December 2015 at 11:11, Henrik Johansen
<henrik.s.johansen at veloxit.no> wrote:
>
>> On 07 Dec 2015, at 11:51 , EuanM <euanmee at gmail.com> wrote:
>>
>> And indeed, in principle.
>>
>> On 7 December 2015 at 10:51, EuanM <euanmee at gmail.com> wrote:
>>> Verifying assumptions is the key reason why you should documents like
>>> this out for review.
>>>
>>> Sven -
>>>
>>> I'm confident I understand the use of UTF-8 in principal.
>
> I can only second Sven's sentiment that you need to better differentiate code points (an abstract numerical representation of a character, where a set of such mappings
> define a charset, such as Unicode), and character encoding forms. (which are how code points are represented in bytes by a defined process such as UTF-8, UTF-16 etc).
>
> I know you'll probably think I'm arguing semantics again, but these are *important* semantics ;)
>
> Cheers,
> Henry


More information about the Squeak-dev mailing list