UTF8 Squeak
Yoshiki Ohshima
yoshiki at squeakland.org
Thu Jun 7 20:30:50 UTC 2007
It is so true that I should've looked at the class names in VW
before doing everything...
> 1. internally everything is in 16bit Unicode, without any additionally
> encoding info attached to strings
If they use 16-bit per char, how do they deal with surrogated pairs?
> 2. there is a class ByteString for pure ASCII(1) and TwoByteString for
> Unicode strings. Conversion from Byte to TwoByteString is automatic
> when you concatenate two mixed-width strings.
This is what Squeak does with ByteString and WideString.
> 3. streams: external streams(2) are always dealing with
> encodings, internal streams never
In Squeak to do conversion from/to file useMultiByteFileStream. For
memory based strings, use MultiByteBinaryOrTextStream. Or, you can
manually create an instance of TextConverter and write some logic to
pass chars from/to streams.
> (1) Strings have actually subclasses for 8 bit encodings like
> ISO8859L1String etc. but this seems not used much recently
So, as in Squeak, having only ByteString and WideString (with a
common abstract superclass) is better^^;
> (2) with help of an EncodedStream as a wrapper of original stream. And
> it is helped by StreamEncoders, which actually do en/decoding.
> There is quite a number of them, from Base64StreamEncoder to for us
> more interesting UTF8StreamEncoder.
As I wrote, you can write these variation of Streams by youself
quite easily. I admit that there is no framework for it.
> I find VW approach very simple and elegant and I think Squeak can solve
> Unicode easily by following VW as an example a bit.
Thank you for summarizing it!
-- Yoshiki
More information about the Squeak-dev
mailing list
|