UTF8 Squeak
Bert Freudenberg
bert at freudenbergs.de
Mon Jun 11 11:35:00 UTC 2007
So except for the missing 16-bit optimization this is exactly as we
do have now, right? So what is the actual proposal?
- Bert -
On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:
> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of
> ASCII. ASCII has code points < 128 and therefore always fits in one
> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one
> should treat Unicode strings as encoded too. And this is an idea
> behind my proposal - to have Unicode strings as collections of
> character code points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)
> which all fit to one byte. ByteStrings which contain plain ASCII
> are therefore already Unicode! Same with Latin 1 ones. It is
> therefore just natural to extend Unicode from byte to two and four
> byte strings to cover all code points. For an user this string is
> still a string as it was when it was just ASCII. This approach is
> therefore also most consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First
> ones are both variable length formats, which means that character
> size is not the same as byte size and it cannot be just simply
> calculated from it. Each character character may be 1, 2, 3 or 4
> bytes depending of the width of its code point.
>
> Because of variable length those encodings are not useful for
> general string manipulation bit just for communication and storage.
> String manipulation would be very inefficient (just consider the
> speed of #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally
> and put all encoding/decoding on the periphery of the image - to
> interfaces to the external world. As Subbukk already suggested we
> could put that to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate
> Encoders and the EncodedStream. They are also depreciating usage of
> EncodedByteStrings like ISO88591String, MACString etc. Why should
> then introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we
> must make it efficient in Squeak too. It must be almost as fast as
> a simple copy.
>
> And for those who still want to have UTF8 encoded string they can
> store them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi Janko -
>> Just as a comment from the sidelines I think that concentrating on
>> the size of the character in an encoding is a mistake. It is
>> really the encoding that matters and if it weren't impractical I
>> would rename ByteString to Latin1String and WideString to
>> UTF32String or so.
>> This makes it much clearer that we are interested more in the
>> encoding than the number of bytes per character (although of
>> course some encodings imply a character size) and this "encoding-
>> driven" view of strings makes it perfectly natural to think of an
>> UTF8String which has a variable sized encoding and can live in
>> perfect harmony with the other "byte encoded strings".
>> In your case, I would rather suggest having a class UTF16String
>> instead of TwoByteString. A good starting point (if you are
>> planning to spend any time on this) would be to create a class
>> EncodedString which captures the basics of conversions between
>> differently encoded strings and start defining a few (trivial)
>> subclasses like mentioned above. From there, you could extend this
>> to UTF-8, UTF-16 and whatever else encoding you need.
>> Cheers,
>> - Andreas
>> Janko Mivšek wrote:
>>> Janko Mivšek wrote:
>>>> I would propose a hibrid solution: three subclasses of String:
>>>>
>>>> 1. ByteString for ASCII (native english speakers
>>>> 2. TwoByteString for most of other languages
>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>
>>> Let me be more exact about that proposal:
>>>
>>> This is for internal representation only, for interfacing to
>>> external world we need to convert to/from (at least) UTF-8
>>> representation.
>>>
>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>> ByteString is therefore always regarded as encoded in ISO8859-1
>>> codepage, which is the same as Unicode Basic Latin (1).
>>>
>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and
>>> many more
>>> (so called Basic Multilingual Pane (2)). Encoding of that string
>>> would correspond to UCS-2, even that it is considered obsolete
>>> (3)
>>>
>>> 3. FourByteString for Chinese/Japanese/Korean and some others.
>>> Encoding
>>> of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>
>>>
>>> I think that this way we can achieve most efficient yet fast
>>> support for all languages on that world. Because of fixed length
>>> those strings are also easy to manipulate contrary to variable
>>> length UTF-8 ones.
>>>
>>> Conversion to/form UTF-8 could probably also be simpler with help
>>> of bit arithmetic algorithms, which would be tailored differently
>>> for each of proposed three string subclasses above.
>>>
>>>
>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.
>>> 2C_and_processing
>>> (2) Wikipedia Basic Multilingual Pane
>>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>> (3) Wikipedia UTF-16/UCS-2:
>>> http://en.wikipedia.org/wiki/UCS-2
>>> (4) Wikipedia UTF-32/UCS-4
>>> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>
>>> Best regards
>>> Janko
>>>
>
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>
More information about the Squeak-dev
mailing list
|