UTF8 Squeak

Sun Jun 10 21:14:13 UTC 2007

Hi Janko -

Just as a comment from the sidelines I think that concentrating on the 
size of the character in an encoding is a mistake. It is really the 
encoding that matters and if it weren't impractical I would rename 
ByteString to Latin1String and WideString to UTF32String or so.

This makes it much clearer that we are interested more in the encoding 
than the number of bytes per character (although of course some 
encodings imply a character size) and this "encoding-driven" view of 
strings makes it perfectly natural to think of an UTF8String which has a 
variable sized encoding and can live in perfect harmony with the other 
"byte encoded strings".

In your case, I would rather suggest having a class UTF16String instead 
of TwoByteString. A good starting point (if you are planning to spend 
any time on this) would be to create a class EncodedString which 
captures the basics of conversions between differently encoded strings 
and start defining a few (trivial) subclasses like mentioned above. From 
there, you could extend this to UTF-8, UTF-16 and whatever else encoding 
you need.

Cheers,
   - Andreas

Janko Mivšek wrote:
> Janko Mivšek wrote:
>> I would propose a hibrid solution: three subclasses of String:
>>
>> 1. ByteString for ASCII (native english speakers
>> 2. TwoByteString for most of other languages
>> 3. FourByteString(WideString) for Japanese/Chinese/and others
> 
> Let me be more exact about that proposal:
> 
> This is for internal representation only, for interfacing to external 
> world we need to convert to/from (at least) UTF-8 representation.
> 
> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>    ByteString  is therefore always regarded as encoded in ISO8859-1
>    codepage, which is the same as Unicode Basic Latin (1).
> 
> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>    (so called Basic Multilingual Pane (2)). Encoding of that string
>    would correspond to UCS-2, even that it is considered obsolete (3)
> 
> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>    of that string would therefore correspond to UCS-4/UTF-32 (4)
> 
> 
> I think that this way we can achieve most efficient yet fast support for 
> all languages on that world. Because of fixed length those strings are 
> also easy to manipulate contrary to variable length UTF-8 ones.
> 
> Conversion to/form UTF-8 could probably also be simpler with help of bit 
> arithmetic algorithms, which would be tailored differently for each of 
> proposed three string subclasses above.
> 
> 
> (1) Wikipedia Unicode: Storage, transfer, and processing
> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
> (2) Wikipedia Basic Multilingual Pane
>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> (3) Wikipedia UTF-16/UCS-2:
>     http://en.wikipedia.org/wiki/UCS-2
> (4) Wikipedia UTF-32/UCS-4
>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
> 
> Best regards
> Janko
>