janko.mivsek at eranova.si
Mon Jun 11 10:35:11 UTC 2007
Let me start with a statement that Unicode is a generalization of ASCII.
ASCII has code points < 128 and therefore always fits in one byte while
Unicode can have 2, 3 or even 4 bytes wide code points.
No one treats ASCII strings as ASCII "encoded" therefore no one should
treat Unicode strings as encoded too. And this is an idea behind my
proposal - to have Unicode strings as collections of character code
points, with different byte widths.
Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
all fit to one byte. ByteStrings which contain plain ASCII are
therefore already Unicode! Same with Latin 1 ones. It is therefore just
natural to extend Unicode from byte to two and four byte strings to
cover all code points. For an user this string is still a string as it
was when it was just ASCII. This approach is therefore also most
When we are talking about Unicode "encodings" we mean UTF (Unicode
Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
are both variable length formats, which means that character size is not
the same as byte size and it cannot be just simply calculated from it.
Each character character may be 1, 2, 3 or 4 bytes depending of the
width of its code point.
Because of variable length those encodings are not useful for general
string manipulation bit just for communication and storage. String
manipulation would be very inefficient (just consider the speed of
#size, which is used everywhere).
I would therefore use strings with pure Unicode content internally and
put all encoding/decoding on the periphery of the image - to interfaces
to the external world. As Subbukk already suggested we could put that
to an UTF8Stream?
VW and Gemstone also put encodings out of string, to separate Encoders
and the EncodedStream. They are also depreciating usage of
EncodedByteStrings like ISO88591String, MACString etc. Why should then
introduce them to Squeak now?
UT8 encoding/decoding is very efficient by design, therefore we must
make it efficient in Squeak too. It must be almost as fast as a simple copy.
And for those who still want to have UTF8 encoded string they can store
them in plain ByteString anyway...
I hope this clarify my ideas a bit.
Andreas Raab wrote:
> Hi Janko -
> Just as a comment from the sidelines I think that concentrating on the
> size of the character in an encoding is a mistake. It is really the
> encoding that matters and if it weren't impractical I would rename
> ByteString to Latin1String and WideString to UTF32String or so.
> This makes it much clearer that we are interested more in the encoding
> than the number of bytes per character (although of course some
> encodings imply a character size) and this "encoding-driven" view of
> strings makes it perfectly natural to think of an UTF8String which has a
> variable sized encoding and can live in perfect harmony with the other
> "byte encoded strings".
> In your case, I would rather suggest having a class UTF16String instead
> of TwoByteString. A good starting point (if you are planning to spend
> any time on this) would be to create a class EncodedString which
> captures the basics of conversions between differently encoded strings
> and start defining a few (trivial) subclasses like mentioned above. From
> there, you could extend this to UTF-8, UTF-16 and whatever else encoding
> you need.
> - Andreas
> Janko Mivšek wrote:
>> Janko Mivšek wrote:
>>> I would propose a hibrid solution: three subclasses of String:
>>> 1. ByteString for ASCII (native english speakers
>>> 2. TwoByteString for most of other languages
>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>> Let me be more exact about that proposal:
>> This is for internal representation only, for interfacing to external
>> world we need to convert to/from (at least) UTF-8 representation.
>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>> ByteString is therefore always regarded as encoded in ISO8859-1
>> codepage, which is the same as Unicode Basic Latin (1).
>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>> (so called Basic Multilingual Pane (2)). Encoding of that string
>> would correspond to UCS-2, even that it is considered obsolete (3)
>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>> of that string would therefore correspond to UCS-4/UTF-32 (4)
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>> (1) Wikipedia Unicode: Storage, transfer, and processing
>> (2) Wikipedia Basic Multilingual Pane
>> (3) Wikipedia UTF-16/UCS-2:
>> (4) Wikipedia UTF-32/UCS-4
>> Best regards
Smalltalk Web Application Server
More information about the Squeak-dev