UTF8 Squeak

Mon Jun 11 11:35:00 UTC 2007

So except for the missing 16-bit optimization this is exactly as we  
do have now, right? So what is the actual proposal?

- Bert -

On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:

> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of  
> ASCII. ASCII has code points < 128 and therefore always fits in one  
> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one  
> should treat Unicode strings as encoded too. And this is an idea  
> behind my proposal - to have Unicode strings as collections of  
> character code points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)  
> which  all fit to one byte. ByteStrings which contain plain ASCII  
> are therefore already Unicode! Same with Latin 1 ones. It is  
> therefore just natural to extend Unicode from byte to two and four  
> byte strings to cover all code points. For an user this string is  
> still a string as it was when it was just ASCII. This approach is  
> therefore also most consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode  
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First  
> ones are both variable length formats, which means that character  
> size is not the same as byte size and it cannot be just simply  
> calculated from it. Each character character may be 1, 2, 3 or 4  
> bytes depending of the width of its code point.
>
> Because of variable length those encodings are not useful for  
> general string manipulation bit just for communication and storage.  
> String manipulation would be very inefficient (just consider the  
> speed of #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally  
> and put all encoding/decoding on the periphery of the image - to  
> interfaces to the external world. As Subbukk already suggested we  
> could put that to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate  
> Encoders and the EncodedStream. They are also depreciating usage of  
> EncodedByteStrings like ISO88591String, MACString etc. Why should  
> then introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we  
> must make it efficient in Squeak too. It must be almost as fast as  
> a simple copy.
>
> And for those who still want to have UTF8 encoded string they can  
> store them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi Janko -
>> Just as a comment from the sidelines I think that concentrating on  
>> the size of the character in an encoding is a mistake. It is  
>> really the encoding that matters and if it weren't impractical I  
>> would rename ByteString to Latin1String and WideString to  
>> UTF32String or so.
>> This makes it much clearer that we are interested more in the  
>> encoding than the number of bytes per character (although of  
>> course some encodings imply a character size) and this "encoding- 
>> driven" view of strings makes it perfectly natural to think of an  
>> UTF8String which has a variable sized encoding and can live in  
>> perfect harmony with the other "byte encoded strings".
>> In your case, I would rather suggest having a class UTF16String  
>> instead of TwoByteString. A good starting point (if you are  
>> planning to spend any time on this) would be to create a class  
>> EncodedString which captures the basics of conversions between  
>> differently encoded strings and start defining a few (trivial)  
>> subclasses like mentioned above. From there, you could extend this  
>> to UTF-8, UTF-16 and whatever else encoding you need.
>> Cheers,
>>   - Andreas
>> Janko Mivšek wrote:
>>> Janko Mivšek wrote:
>>>> I would propose a hibrid solution: three subclasses of String:
>>>>
>>>> 1. ByteString for ASCII (native english speakers
>>>> 2. TwoByteString for most of other languages
>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>
>>> Let me be more exact about that proposal:
>>>
>>> This is for internal representation only, for interfacing to  
>>> external world we need to convert to/from (at least) UTF-8  
>>> representation.
>>>
>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>>    codepage, which is the same as Unicode Basic Latin (1).
>>>
>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and  
>>> many more
>>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>>    would correspond to UCS-2, even that it is considered obsolete  
>>> (3)
>>>
>>> 3. FourByteString for Chinese/Japanese/Korean and some others.  
>>> Encoding
>>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>
>>>
>>> I think that this way we can achieve most efficient yet fast  
>>> support for all languages on that world. Because of fixed length  
>>> those strings are also easy to manipulate contrary to variable  
>>> length UTF-8 ones.
>>>
>>> Conversion to/form UTF-8 could probably also be simpler with help  
>>> of bit arithmetic algorithms, which would be tailored differently  
>>> for each of proposed three string subclasses above.
>>>
>>>
>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer. 
>>> 2C_and_processing
>>> (2) Wikipedia Basic Multilingual Pane
>>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>> (3) Wikipedia UTF-16/UCS-2:
>>>     http://en.wikipedia.org/wiki/UCS-2
>>> (4) Wikipedia UTF-32/UCS-4
>>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>
>>> Best regards
>>> Janko
>>>
>
>
> -- 
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>