[squeak-dev] how to create an UTF-8 character

Yoshiki Ohshima yoshiki at vpri.org
Wed Sep 24 09:26:43 UTC 2008


At Wed, 24 Sep 2008 10:49:18 +0200,
Damien Pollet wrote:
> 
> Is there a reason (other than history) why Strings are not collections
> of unicode characters (at least as viewed from outside) rather than
> bytes in some unknown encoding (which should be encapsulated and only
> appear when text goes in and out the image) ? Or is it already like
> that ?

  I think the answer is that it is already *like that*, although I
can't tell what you mean by "from outside".

  In the image, a ByteString or WideString is a sequence of characters
that hold Unicode code points.  (Note that a Unicode code point is
21-bit.) if all the code in a string fits within 8-bit, we use
ByteString. if it doesn't it uses WideString, but the distinction is
more or less hidden from a casual user.  The conversion is only needed
when the String is interfacing with the outside of the image.

  A Unicode code point doesn't really corresponds to the concept of a
character, if you think an accented character a "character".  The
original concept of Unicode was that such "character" should be always
represented as the sequence of code points; one base character, and
one or more accent marks.  It was at least pure and fair.

  But, they got the "Latin-1 compatibility" idea around 1990 in a
retrofitted way; so the original idea of "Let us make a universal
character set for everybody in the world" was turned to: "Let us make
a universal character set for everybody in the world, but let's treat
Westerners nicer."  But of course this turn made the situation where a
simple accented character has two (precomposed and decomposed)
representations.  Squeak is still way behind and prefers the
precomposed "normalization", but the normalization is really lax
there.

  To me, the han unification is another evidence of "Westerners first"
idea.  If tracing back to the origin of characters is the concept, i
and j should be perhaps unified as well (just kidding).

  But, Unicode is the standard now, and it does solve a lot of
problems.  So using it as the base but putting necessary information
around it to support it is a good way in principle.

  If so, one could argue that we can just hold every string in
decomposed UTF-8 in the image, and have a couple of variants of at:
and at:put:.  The requirement of O(1) random access is not that
important.  I might go that direction if I redo it now.

-- Yoshiki



More information about the Squeak-dev mailing list