UTF8 Squeak
Norbert Hartl
norbert at hartl.name
Mon Jun 11 16:08:27 UTC 2007
On Mon, 2007-06-11 at 12:35 +0200, Janko Mivšek wrote:
> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of ASCII.
> ASCII has code points < 128 and therefore always fits in one byte while
> Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one should
> treat Unicode strings as encoded too. And this is an idea behind my
> proposal - to have Unicode strings as collections of character code
> points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
> all fit to one byte. ByteStrings which contain plain ASCII are
> therefore already Unicode! Same with Latin 1 ones. It is therefore just
> natural to extend Unicode from byte to two and four byte strings to
> cover all code points. For an user this string is still a string as it
> was when it was just ASCII. This approach is therefore also most
> consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
> are both variable length formats, which means that character size is not
> the same as byte size and it cannot be just simply calculated from it.
> Each character character may be 1, 2, 3 or 4 bytes depending of the
> width of its code point.
>
> Because of variable length those encodings are not useful for general
> string manipulation bit just for communication and storage. String
> manipulation would be very inefficient (just consider the speed of
> #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally and
> put all encoding/decoding on the periphery of the image - to interfaces
> to the external world. As Subbukk already suggested we could put that
> to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate Encoders
> and the EncodedStream. They are also depreciating usage of
> EncodedByteStrings like ISO88591String, MACString etc. Why should then
> introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we must
> make it efficient in Squeak too. It must be almost as fast as a simple copy.
>
> And for those who still want to have UTF8 encoded string they can store
> them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
Yes, absolutely. And this time I like to fully agree in public :)
Norbert
More information about the Squeak-dev
mailing list
|