UTF8 Squeak

Thu Jun 7 20:30:50 UTC 2007

  It is so true that I should've looked at the class names in VW
before doing everything...

> 1. internally everything is in 16bit Unicode, without any additionally
>     encoding info attached to strings

  If they use 16-bit per char, how do they deal with surrogated pairs?

> 2. there is a class ByteString for pure ASCII(1) and TwoByteString for
>     Unicode strings. Conversion from Byte to TwoByteString is automatic
>     when you concatenate two mixed-width strings.

  This is what Squeak does with ByteString and WideString.

> 3. streams: external streams(2) are always dealing with
>     encodings, internal streams never

  In Squeak to do conversion from/to file useMultiByteFileStream.  For
memory based strings, use MultiByteBinaryOrTextStream.  Or, you can
manually create an instance of TextConverter and write some logic to
pass chars from/to streams.

> (1) Strings have actually subclasses for 8 bit encodings like
>      ISO8859L1String etc. but this seems not used much recently

  So, as in Squeak, having only ByteString and WideString (with a
common abstract superclass) is better^^;

> (2) with help of an EncodedStream as a wrapper of original stream. And
>      it is helped by StreamEncoders, which actually do en/decoding.
>      There is quite a number of them, from Base64StreamEncoder to for us
>      more interesting UTF8StreamEncoder.

  As I wrote, you can write these variation of Streams by youself
quite easily.  I admit that there is no framework for it.

> I find VW approach very simple and elegant and I think Squeak can solve 
> Unicode easily by following VW as an example a bit.

  Thank you for summarizing it!

-- Yoshiki