[squeak-dev] Re: [Pharo-dev] Unicode Support

Ben Coman btc at openinworld.com
Mon Dec 7 15:04:24 UTC 2015


On Mon, Dec 7, 2015 at 10:48 PM, Henrik Johansen
<henrik.s.johansen at veloxit.no> wrote:
>
> On 07 Dec 2015, at 2:06 , Henrik Johansen <henrik.s.johansen at veloxit.no>
> wrote:
>
>
> codepoints represent "*encoded characters*"
>
>
> No. a codepoint is the numerical value assigned to a character. An "encoded
> character" is the way a codepoint is represented in bytes using a given
> encoding.
>
>
> You were right on this point, I see I remembered the terminology of this
> incorrectly.
> http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf figure 2.8 does use
> "encoded characters" for the mapping of abstract characters to its
> equivalent codepoint (s/ sequences). What I thought it meant is better
> described as a codepoint's byte output using an "encoding scheme".
>
> An accurate description following that terminology,  would be that
> Pharo/Squeak Strings keep data in UTF32 encoding form, where 1 codepoint = 1
> code unit, dynamically switched between Latin1 (ByteStrings) and UTF32
> (WideStrings) encoding schemes as needed.

The implication from Joel's unicode article (linked from my other
post) is that whatever encoding we use to store strings, the encoding
should not be implicit (i.e. by convention defined outside the image).
*Every* string needs to record its encoding.  Maybe we should follow
Swift [1] and have Characters comprised of multiple codepoints, and/or
a String be able to handle a sequence of differently encoded
Characters, i.e. String being a mixed sequence of UTF-8, UTF-16,
UTF-32 Characters.  I have no idea what that wold do for efficiency,
but maybe let Moore's Law handle that.

[1] http://oleb.net/blog/2014/07/swift-strings/
>> A Swift Character represents one perceived character (what a person thinks of as a single character, called a grapheme). Since Unicode often uses two or more code points(called a grapheme cluster) to form one perceived character, this implies that a Charactercan be composed of multiple Unicode scalar values if they form a single grapheme cluster.

cheers -ben

> With the same terminology, the difference between a code point, a code unit,
> how an encoding scheme represents a codepoint as code units/bytes, are the
> concepts it is important to distinguish.
> Quite a mouthful though!


More information about the Squeak-dev mailing list