UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Mon Jun 11 10:35:11 UTC 2007


Hi Andreas,

Let me start with a statement that Unicode is a generalization of ASCII. 
ASCII has code points < 128 and therefore always fits in one byte while 
Unicode can have 2, 3 or even 4 bytes wide code points.

No one treats ASCII strings as ASCII "encoded" therefore no one should 
treat Unicode strings as encoded too. And this is an idea behind my 
proposal - to have Unicode strings as collections of character code 
points, with different byte widths.

Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which 
  all fit to one byte. ByteStrings which contain plain ASCII are 
therefore already Unicode! Same with Latin 1 ones. It is therefore just 
natural to extend Unicode from byte to two and four byte strings to 
cover all code points. For an user this string is still a string as it 
was when it was just ASCII. This approach is therefore also most 
consistent one.

When we are talking about Unicode "encodings" we mean UTF (Unicode 
Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones 
are both variable length formats, which means that character size is not 
the same as byte size and it cannot be just simply calculated from it. 
Each character character may be 1, 2, 3 or 4 bytes depending of the 
width of its code point.

Because of variable length those encodings are not useful for general 
string manipulation bit just for communication and storage. String 
manipulation would be very inefficient (just consider the speed of 
#size, which is used everywhere).

I would therefore use strings with pure Unicode content internally and 
put all encoding/decoding on the periphery of the image - to interfaces 
to the external world. As Subbukk already suggested we could put that 
to an UTF8Stream?

VW and Gemstone also put encodings out of string, to separate Encoders 
and the EncodedStream. They are also depreciating usage of 
EncodedByteStrings like ISO88591String, MACString etc. Why should then 
introduce them to Squeak now?

UT8 encoding/decoding is very efficient by design, therefore we must 
make it efficient in Squeak too. It must be almost as fast as a simple copy.

And for those who still want to have UTF8 encoded string they can store 
them in plain ByteString anyway...

I hope this clarify my ideas a bit.

Best regards
Janko

Andreas Raab wrote:
> Hi Janko -
> 
> Just as a comment from the sidelines I think that concentrating on the 
> size of the character in an encoding is a mistake. It is really the 
> encoding that matters and if it weren't impractical I would rename 
> ByteString to Latin1String and WideString to UTF32String or so.
> 
> This makes it much clearer that we are interested more in the encoding 
> than the number of bytes per character (although of course some 
> encodings imply a character size) and this "encoding-driven" view of 
> strings makes it perfectly natural to think of an UTF8String which has a 
> variable sized encoding and can live in perfect harmony with the other 
> "byte encoded strings".
> 
> In your case, I would rather suggest having a class UTF16String instead 
> of TwoByteString. A good starting point (if you are planning to spend 
> any time on this) would be to create a class EncodedString which 
> captures the basics of conversions between differently encoded strings 
> and start defining a few (trivial) subclasses like mentioned above. From 
> there, you could extend this to UTF-8, UTF-16 and whatever else encoding 
> you need.
> 
> Cheers,
>   - Andreas
> 
> Janko Mivšek wrote:
>> Janko Mivšek wrote:
>>> I would propose a hibrid solution: three subclasses of String:
>>>
>>> 1. ByteString for ASCII (native english speakers
>>> 2. TwoByteString for most of other languages
>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>
>> Let me be more exact about that proposal:
>>
>> This is for internal representation only, for interfacing to external 
>> world we need to convert to/from (at least) UTF-8 representation.
>>
>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>    codepage, which is the same as Unicode Basic Latin (1).
>>
>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>    would correspond to UCS-2, even that it is considered obsolete (3)
>>
>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>
>>
>> I think that this way we can achieve most efficient yet fast support 
>> for all languages on that world. Because of fixed length those strings 
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>>
>> Conversion to/form UTF-8 could probably also be simpler with help of 
>> bit arithmetic algorithms, which would be tailored differently for 
>> each of proposed three string subclasses above.
>>
>>
>> (1) Wikipedia Unicode: Storage, transfer, and processing
>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing 
>>
>> (2) Wikipedia Basic Multilingual Pane
>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>> (3) Wikipedia UTF-16/UCS-2:
>>     http://en.wikipedia.org/wiki/UCS-2
>> (4) Wikipedia UTF-32/UCS-4
>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>
>> Best regards
>> Janko
>>


-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



More information about the Squeak-dev mailing list