UTF8 Squeak

Fri Jun 8 05:58:01 UTC 2007

  Alan,

> Well, perhaps UTF-32 would be a better default, now that I think about
> it--due to performance issues for accessing characters at an index. But
> using 32-bit-wide or 16-bit-wide strings internally as the only option would
> be a waste of memory in many situations, especially for the "Latin-1"
> languages.

  We do switch in Squeak from different bit-width representation (8
and 32) whenever necessary or favorable.

> Having String instances that use specified encodings enables one to avoid
> doing conversions unless and until it's needed. It also makes it easy to
> deal with the data as it will actually exist when persisted, or when
> transported over the network. And it makes it easier to handle the host
> plaform's native character encodings (there may be more than one,) or the
> character encodings used by external libraries or applications that either
> offer callpoints to, or consume callpoints from, a Squeak process. It also
> documents the encoding used by each String.

  Nothing prevents you from using a String as if it is, say, a
ByteArray.  For example, you can pass a String or ByteArray to a
socket primitive to fill it and you can keep the bits in it as you
like.

  However, Smalltalk is not just about holding data; once it comes to
displaying a String, concatenating them, comparing them, etc., etc.,
you do have to have a canonical form.

> If all Strings use UTF-32, and are only converted to other encodings by the
> VM, how does one write Smalltalk code to convert text from one character
> encoding to another?  I'd rather not make character encodings yet another
> bit of magic that only the VM can do.

  Hmm.  Of course you can convert encodings in memory.  In Squeak,
there are bunch of subclasses of TextConverter.  Did anybody
mentioned/suggested that the conversion has to be a VM magic?

> It is already the case that accessing individual characters from a String
> results in the reification of a Character object.  So, leveraging what is
> already the case, convervsion to/from the internal encoding to the canonical
> (Unicode) encoding should occur when a Character object is reified from an
> encoded character in a String (or in a Stream.)  Character objects that are
> "put:" into a String would be converted from the Unicode code point to the
> encoding native to that String.  Using Character reification to/from Unicode
> as the unification mechanism provides the illusion that all Strings use the
> same code points for their characters, even though they in fact do
> not.

  You criticized an approach nobody advocated as "magic" in above, but
what you wrote here is really a magic.  I've got a feeling that this
system would be very hard to debug.

  BTW, what would you do with Symbols?

> Of course, for some encodings (such as UTF-8) there would probably be a
> performance penalty for accessing characters at an arbitrary index ("aString
> at: n.") But there may be good ways to mitigate that, using clever
> implementation tricks (caveat: I haven't actually tried it.)  However, with
> my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all
> Strings, or ASCII for all Strings--based on one's space and performance
> constraints, and based on the character repertoire one needs for one's user
> base.  And the conversion to UTF-16 or UTF-32 (or whatever) can be done when
> the String is read from an external Stream (using the VW stream decorator
> approach, for example.)

  I *do* see some upsides of this approach, actually, but the
downsides is overwhelming bigger, if you think that Smalltalk is a
self-contained system.  Handling keyboard input alone would make the
system really complex.

  IIUC, Matsumoto-san's (Matz) m17n idea for Ruby is sort of along
this line.  I don't think that is a good approach, but it is slightly
more acceptable in Ruby, because Ruby is not a whole system.

  BTW, current Squeak allows you to do this.  Within the 32-bit
quantity, the first several bits denotes the "language"; you can make
up a special language and store the code point in different encodings.

> The ASCII encoding would be good for the mutlitude of legacy applications
> that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx
> legacy applications that have to deal with non-English languages, or have to
> deal with either HTML or pre-Vista Windows. UTF-x would be best for most
> other situations.

  Is this your observation?  Where does legacy application in Japanese
fit?  Why HTML is associated with latin-1?  What is special about
Vista Windows?  This doesn't make any good sense.

  One approach I might try in a "new system" would be:

  - the bits of raw string representation is in UTF-8 but it is not
    really displayable.
  - you always do stuff though an equivalent of Text, that carry
    enough attributes for the bits.
  - maybe remove character object.  A "character" is just a short Text.
    For the ASCII part, it could be a special case; i.e., a naked
    byte can have implicit text attributes by default.

-- Yoshiki