UTF8 Squeak
Andreas Raab
andreas.raab at gmx.de
Sun Jun 10 21:14:13 UTC 2007
Hi Janko -
Just as a comment from the sidelines I think that concentrating on the
size of the character in an encoding is a mistake. It is really the
encoding that matters and if it weren't impractical I would rename
ByteString to Latin1String and WideString to UTF32String or so.
This makes it much clearer that we are interested more in the encoding
than the number of bytes per character (although of course some
encodings imply a character size) and this "encoding-driven" view of
strings makes it perfectly natural to think of an UTF8String which has a
variable sized encoding and can live in perfect harmony with the other
"byte encoded strings".
In your case, I would rather suggest having a class UTF16String instead
of TwoByteString. A good starting point (if you are planning to spend
any time on this) would be to create a class EncodedString which
captures the basics of conversions between differently encoded strings
and start defining a few (trivial) subclasses like mentioned above. From
there, you could extend this to UTF-8, UTF-16 and whatever else encoding
you need.
Cheers,
- Andreas
Janko Mivšek wrote:
> Janko Mivšek wrote:
>> I would propose a hibrid solution: three subclasses of String:
>>
>> 1. ByteString for ASCII (native english speakers
>> 2. TwoByteString for most of other languages
>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>
> Let me be more exact about that proposal:
>
> This is for internal representation only, for interfacing to external
> world we need to convert to/from (at least) UTF-8 representation.
>
> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
> ByteString is therefore always regarded as encoded in ISO8859-1
> codepage, which is the same as Unicode Basic Latin (1).
>
> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
> (so called Basic Multilingual Pane (2)). Encoding of that string
> would correspond to UCS-2, even that it is considered obsolete (3)
>
> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
> of that string would therefore correspond to UCS-4/UTF-32 (4)
>
>
> I think that this way we can achieve most efficient yet fast support for
> all languages on that world. Because of fixed length those strings are
> also easy to manipulate contrary to variable length UTF-8 ones.
>
> Conversion to/form UTF-8 could probably also be simpler with help of bit
> arithmetic algorithms, which would be tailored differently for each of
> proposed three string subclasses above.
>
>
> (1) Wikipedia Unicode: Storage, transfer, and processing
> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
> (2) Wikipedia Basic Multilingual Pane
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> (3) Wikipedia UTF-16/UCS-2:
> http://en.wikipedia.org/wiki/UCS-2
> (4) Wikipedia UTF-32/UCS-4
> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>
> Best regards
> Janko
>
More information about the Squeak-dev
mailing list
|