UTF8 Squeak

Masashi UMEZAWA masashi.umezawa at gmail.com
Wed Jun 13 01:26:42 UTC 2007

Hi Janko,

> >> 1. internally everything is in 16bit Unicode, without any additionally
> >>     encoding info attached to strings
> >
> >   If they use 16-bit per char, how do they deal with surrogated pairs?
> I looked once again and there is actually a FourByteString too. This
> probably answer your question. VW also support Japanese locale well.

Just for correction. VW does not support "surrogate pairs" well. A
Character whose value is greater than 65535 would easily crash the
image. This is a quote of Character comment.
For character codes between 0 and 65535 (16rFFFF), the Unicode
Character Code Standard is used.  Characters with codes between 0 and
255 also coincide with the ISO 8859-1 standard. At present, mappings
for Characters greater than 65535 are undefined, and such characters
are not fully supported. In time, these will probably be defined to
conform to the ISO 10646 superset of Unicode.

In VW, Japanese string is represented as TwoByteString. So, it cannot
handle a part of Japanese characters. (But practically, in most cases,
it is enough. And it is also good for reducing memory consumption).

[:masashi | ^umezawa]

More information about the Squeak-dev mailing list