UTF8 Squeak

Mon Jun 11 18:27:46 UTC 2007

  Janko,

> It seems that this was already a Yoshiki idea with WideString, so I'm 
> just extending that idea with a TwoByteString to cover 16 bits too.
> 
> Yoshiki, am I right?

  For storing the bare Unicode code points, I think so.  I'm not
convinced that adding 16-bit variation solves any real problems.  But
there may be something.

  My first a few questions are:

  - While vast majority of strings for, say, Japanese can be
    represented with in the characters in BMP, you would use
    FourByteString for Chinese/Japanese/Korean and some others.  Does
    this mean that you would *always* use FourByteString for these
    "languages" (and not scripts?)

  - Suppose you would like to use different line wrapping algorithms
    for different languages, how would you keep that information?

-- Yoshiki