UTF8 Squeak

Sun Jun 10 10:55:20 UTC 2007

Janko Mivšek wrote:
> I would propose a hibrid solution: three subclasses of String:
> 
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others

Let me be more exact about that proposal:

This is for internal representation only, for interfacing to external 
world we need to convert to/from (at least) UTF-8 representation.

1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
    ByteString  is therefore always regarded as encoded in ISO8859-1
    codepage, which is the same as Unicode Basic Latin (1).

2. TwoByteString for East European Latin, Greek, Cyrillic and many more
    (so called Basic Multilingual Pane (2)). Encoding of that string
    would correspond to UCS-2, even that it is considered obsolete (3)

3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
    of that string would therefore correspond to UCS-4/UTF-32 (4)

I think that this way we can achieve most efficient yet fast support for 
all languages on that world. Because of fixed length those strings are 
also easy to manipulate contrary to variable length UTF-8 ones.

Conversion to/form UTF-8 could probably also be simpler with help of bit 
arithmetic algorithms, which would be tailored differently for each of 
proposed three string subclasses above.

(1) Wikipedia Unicode: Storage, transfer, and processing
http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
(2) Wikipedia Basic Multilingual Pane
     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
(3) Wikipedia UTF-16/UCS-2:
     http://en.wikipedia.org/wiki/UCS-2
(4) Wikipedia UTF-32/UCS-4
     http://en.wikipedia.org/wiki/UTF-32/UCS-4

Best regards
Janko

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si