UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Mon Jun 11 12:04:09 UTC 2007


Bert Freudenberg wrote:
> So except for the missing 16-bit optimization this is exactly as we do 
> have now, right? So what is the actual proposal?

Exactly. There is already a WideString and my proposal is just to 
introduce a TwoByteString and rename WideString to FourByteString for 
consistency.

That way we'll cover all Unicode strings as efficiently as possible yet 
manageable with string manipulations like usual strings.

But main point of my proposal is to treat internal strings as Unicode 
and only Unicode and nothing else. All other encodings must be converted 
to Unicode at the borders of an image. Those conversions could be done 
with separate Encoders or EncodedStreams.

It seems that this was already a Yoshiki idea with WideString, so I'm 
just extending that idea with a TwoByteString to cover 16 bits too.

Yoshiki, am I right?

Best regards
Janko


> 
> - Bert -
> 
> On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:
> 
>> Hi Andreas,
>>
>> Let me start with a statement that Unicode is a generalization of 
>> ASCII. ASCII has code points < 128 and therefore always fits in one 
>> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>>
>> No one treats ASCII strings as ASCII "encoded" therefore no one should 
>> treat Unicode strings as encoded too. And this is an idea behind my 
>> proposal - to have Unicode strings as collections of character code 
>> points, with different byte widths.
>>
>> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) 
>> which  all fit to one byte. ByteStrings which contain plain ASCII are 
>> therefore already Unicode! Same with Latin 1 ones. It is therefore 
>> just natural to extend Unicode from byte to two and four byte strings 
>> to cover all code points. For an user this string is still a string as 
>> it was when it was just ASCII. This approach is therefore also most 
>> consistent one.
>>
>> When we are talking about Unicode "encodings" we mean UTF (Unicode 
>> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones 
>> are both variable length formats, which means that character size is 
>> not the same as byte size and it cannot be just simply calculated from 
>> it. Each character character may be 1, 2, 3 or 4 bytes depending of 
>> the width of its code point.
>>
>> Because of variable length those encodings are not useful for general 
>> string manipulation bit just for communication and storage. String 
>> manipulation would be very inefficient (just consider the speed of 
>> #size, which is used everywhere).
>>
>> I would therefore use strings with pure Unicode content internally and 
>> put all encoding/decoding on the periphery of the image - to 
>> interfaces to the external world. As Subbukk already suggested we 
>> could put that to an UTF8Stream?
>>
>> VW and Gemstone also put encodings out of string, to separate Encoders 
>> and the EncodedStream. They are also depreciating usage of 
>> EncodedByteStrings like ISO88591String, MACString etc. Why should then 
>> introduce them to Squeak now?
>>
>> UT8 encoding/decoding is very efficient by design, therefore we must 
>> make it efficient in Squeak too. It must be almost as fast as a simple 
>> copy.
>>
>> And for those who still want to have UTF8 encoded string they can 
>> store them in plain ByteString anyway...
>>
>> I hope this clarify my ideas a bit.
>>
>> Best regards
>> Janko
>>
>> Andreas Raab wrote:
>>> Hi Janko -
>>> Just as a comment from the sidelines I think that concentrating on 
>>> the size of the character in an encoding is a mistake. It is really 
>>> the encoding that matters and if it weren't impractical I would 
>>> rename ByteString to Latin1String and WideString to UTF32String or so.
>>> This makes it much clearer that we are interested more in the 
>>> encoding than the number of bytes per character (although of course 
>>> some encodings imply a character size) and this "encoding-driven" 
>>> view of strings makes it perfectly natural to think of an UTF8String 
>>> which has a variable sized encoding and can live in perfect harmony 
>>> with the other "byte encoded strings".
>>> In your case, I would rather suggest having a class UTF16String 
>>> instead of TwoByteString. A good starting point (if you are planning 
>>> to spend any time on this) would be to create a class EncodedString 
>>> which captures the basics of conversions between differently encoded 
>>> strings and start defining a few (trivial) subclasses like mentioned 
>>> above. From there, you could extend this to UTF-8, UTF-16 and 
>>> whatever else encoding you need.
>>> Cheers,
>>>   - Andreas
>>> Janko Mivšek wrote:
>>>> Janko Mivšek wrote:
>>>>> I would propose a hibrid solution: three subclasses of String:
>>>>>
>>>>> 1. ByteString for ASCII (native english speakers
>>>>> 2. TwoByteString for most of other languages
>>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>>
>>>> Let me be more exact about that proposal:
>>>>
>>>> This is for internal representation only, for interfacing to 
>>>> external world we need to convert to/from (at least) UTF-8 
>>>> representation.
>>>>
>>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>>>    codepage, which is the same as Unicode Basic Latin (1).
>>>>
>>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>>>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>>>    would correspond to UCS-2, even that it is considered obsolete (3)
>>>>
>>>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>>>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>>
>>>>
>>>> I think that this way we can achieve most efficient yet fast support 
>>>> for all languages on that world. Because of fixed length those 
>>>> strings are also easy to manipulate contrary to variable length 
>>>> UTF-8 ones.
>>>>
>>>> Conversion to/form UTF-8 could probably also be simpler with help of 
>>>> bit arithmetic algorithms, which would be tailored differently for 
>>>> each of proposed three string subclasses above.
>>>>
>>>>
>>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing 
>>>>
>>>> (2) Wikipedia Basic Multilingual Pane
>>>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>>> (3) Wikipedia UTF-16/UCS-2:
>>>>     http://en.wikipedia.org/wiki/UCS-2
>>>> (4) Wikipedia UTF-32/UCS-4
>>>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>>
>>>> Best regards
>>>> Janko
>>>>
>>
>>
>> --Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
> 
> 
> 
> 

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



More information about the Squeak-dev mailing list