UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Thu Jun 7 20:16:35 UTC 2007


Because I'm coming from VisualWorks world, let me explain a bit how the 
Unicode support is solved there:

1. internally everything is in 16bit Unicode, without any additionally
    encoding info attached to strings
2. there is a class ByteString for pure ASCII(1) and TwoByteString for
    Unicode strings. Conversion from Byte to TwoByteString is automatic
    when you concatenate two mixed-width strings.
3. streams: external streams(2) are always dealing with
    encodings, internal streams never

(1) Strings have actually subclasses for 8 bit encodings like
     ISO8859L1String etc. but this seems not used much recently
(2) with help of an EncodedStream as a wrapper of original stream. And
     it is helped by StreamEncoders, which actually do en/decoding.
     There is quite a number of them, from Base64StreamEncoder to for us
     more interesting UTF8StreamEncoder.

I find VW approach very simple and elegant and I think Squeak can solve 
Unicode easily by following VW as an example a bit.

Best regards
Janko

Alan Lovejoy wrote:
> Each String object should specify its encoding scheme.  UTF-8 should be the
> default, but all commonly-encounterd encodings should be supported, and
> should all be useable at once (in different String instances.) When a
> Character is reified from a String, it should use the Unicode code point
> values (full 32-bit value.)  Ideally, the encoding of a String should be a
> function of an associated Strategy object, and not be based on having
> different subclasses of String.

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



More information about the Squeak-dev mailing list