UTF8 Squeak

Mon Jun 11 16:08:27 UTC 2007

On Mon, 2007-06-11 at 12:35 +0200, Janko Mivšek wrote:
> Hi Andreas,
> 
> Let me start with a statement that Unicode is a generalization of ASCII. 
> ASCII has code points < 128 and therefore always fits in one byte while 
> Unicode can have 2, 3 or even 4 bytes wide code points.
> 
> No one treats ASCII strings as ASCII "encoded" therefore no one should 
> treat Unicode strings as encoded too. And this is an idea behind my 
> proposal - to have Unicode strings as collections of character code 
> points, with different byte widths.
> 
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which 
>   all fit to one byte. ByteStrings which contain plain ASCII are 
> therefore already Unicode! Same with Latin 1 ones. It is therefore just 
> natural to extend Unicode from byte to two and four byte strings to 
> cover all code points. For an user this string is still a string as it 
> was when it was just ASCII. This approach is therefore also most 
> consistent one.
> 
> When we are talking about Unicode "encodings" we mean UTF (Unicode 
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones 
> are both variable length formats, which means that character size is not 
> the same as byte size and it cannot be just simply calculated from it. 
> Each character character may be 1, 2, 3 or 4 bytes depending of the 
> width of its code point.
> 
> Because of variable length those encodings are not useful for general 
> string manipulation bit just for communication and storage. String 
> manipulation would be very inefficient (just consider the speed of 
> #size, which is used everywhere).
> 
> I would therefore use strings with pure Unicode content internally and 
> put all encoding/decoding on the periphery of the image - to interfaces 
> to the external world. As Subbukk already suggested we could put that 
> to an UTF8Stream?
> 
> VW and Gemstone also put encodings out of string, to separate Encoders 
> and the EncodedStream. They are also depreciating usage of 
> EncodedByteStrings like ISO88591String, MACString etc. Why should then 
> introduce them to Squeak now?
> 
> UT8 encoding/decoding is very efficient by design, therefore we must 
> make it efficient in Squeak too. It must be almost as fast as a simple copy.
> 
> And for those who still want to have UTF8 encoded string they can store 
> them in plain ByteString anyway...
> 
> I hope this clarify my ideas a bit.
> 
Yes, absolutely. And this time I like to fully agree in public :)

Norbert