UTF8 Squeak
Janko Mivšek
janko.mivsek at eranova.si
Sun Jun 10 10:55:20 UTC 2007
Janko Mivšek wrote:
> I would propose a hibrid solution: three subclasses of String:
>
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others
Let me be more exact about that proposal:
This is for internal representation only, for interfacing to external
world we need to convert to/from (at least) UTF-8 representation.
1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
ByteString is therefore always regarded as encoded in ISO8859-1
codepage, which is the same as Unicode Basic Latin (1).
2. TwoByteString for East European Latin, Greek, Cyrillic and many more
(so called Basic Multilingual Pane (2)). Encoding of that string
would correspond to UCS-2, even that it is considered obsolete (3)
3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
of that string would therefore correspond to UCS-4/UTF-32 (4)
I think that this way we can achieve most efficient yet fast support for
all languages on that world. Because of fixed length those strings are
also easy to manipulate contrary to variable length UTF-8 ones.
Conversion to/form UTF-8 could probably also be simpler with help of bit
arithmetic algorithms, which would be tailored differently for each of
proposed three string subclasses above.
(1) Wikipedia Unicode: Storage, transfer, and processing
http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
(2) Wikipedia Basic Multilingual Pane
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
(3) Wikipedia UTF-16/UCS-2:
http://en.wikipedia.org/wiki/UCS-2
(4) Wikipedia UTF-32/UCS-4
http://en.wikipedia.org/wiki/UTF-32/UCS-4
Best regards
Janko
--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si
More information about the Squeak-dev
mailing list
|