UTF8 Squeak

Fri Jun 8 03:16:21 UTC 2007

<Alan L>Each String object should specify its encoding scheme.  UTF-8 should

> be the default, but all commonly-encounterd encodings should be
> supported, and should all be useable at once (in different String
> instances.) When a Character is reified from a String, it should use
> the Unicode code point values (full 32-bit value.)  Ideally, the
> encoding of a String should be a function of an associated Strategy
> object, and not be based on having different subclasses of String
</Alan L>

<Yoshiki>Is this better than using UTF32 throught the image for all Strings?
One reason would be that for some chars in domestic encodings, the
round-trip conversion is not exactly guaranteed; so you can avoid that
problem in this way.  But ohter than that, encodings only matters when the
system is interfacing with the outside world.  So, the internal
representation can be uniform, I think.

  Would you write all comparison methods for all of combinations of
different encodings?
</Yoshiki>

Well, perhaps UTF-32 would be a better default, now that I think about
it--due to performance issues for accessing characters at an index. But
using 32-bit-wide or 16-bit-wide strings internally as the only option would
be a waste of memory in many situations, especially for the "Latin-1"
languages.

Having String instances that use specified encodings enables one to avoid
doing conversions unless and until it's needed. It also makes it easy to
deal with the data as it will actually exist when persisted, or when
transported over the network. And it makes it easier to handle the host
plaform's native character encodings (there may be more than one,) or the
character encodings used by external libraries or applications that either
offer callpoints to, or consume callpoints from, a Squeak process. It also
documents the encoding used by each String.

If all Strings use UTF-32, and are only converted to other encodings by the
VM, how does one write Smalltalk code to convert text from one character
encoding to another?  I'd rather not make character encodings yet another
bit of magic that only the VM can do.

It is already the case that accessing individual characters from a String
results in the reification of a Character object.  So, leveraging what is
already the case, convervsion to/from the internal encoding to the canonical
(Unicode) encoding should occur when a Character object is reified from an
encoded character in a String (or in a Stream.)  Character objects that are
"put:" into a String would be converted from the Unicode code point to the
encoding native to that String.  Using Character reification to/from Unicode
as the unification mechanism provides the illusion that all Strings use the
same code points for their characters, even though they in fact do not.

Of course, for some encodings (such as UTF-8) there would probably be a
performance penalty for accessing characters at an arbitrary index ("aString
at: n.") But there may be good ways to mitigate that, using clever
implementation tricks (caveat: I haven't actually tried it.)  However, with
my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all
Strings, or ASCII for all Strings--based on one's space and performance
constraints, and based on the character repertoire one needs for one's user
base.  And the conversion to UTF-16 or UTF-32 (or whatever) can be done when
the String is read from an external Stream (using the VW stream decorator
approach, for example.)

The ASCII encoding would be good for the mutlitude of legacy applications
that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx
legacy applications that have to deal with non-English languages, or have to
deal with either HTML or pre-Vista Windows. UTF-x would be best for most
other situations.

--Alan