yoshiki at squeakland.org
Fri Jun 8 05:58:01 UTC 2007
> Well, perhaps UTF-32 would be a better default, now that I think about
> it--due to performance issues for accessing characters at an index. But
> using 32-bit-wide or 16-bit-wide strings internally as the only option would
> be a waste of memory in many situations, especially for the "Latin-1"
We do switch in Squeak from different bit-width representation (8
and 32) whenever necessary or favorable.
> Having String instances that use specified encodings enables one to avoid
> doing conversions unless and until it's needed. It also makes it easy to
> deal with the data as it will actually exist when persisted, or when
> transported over the network. And it makes it easier to handle the host
> plaform's native character encodings (there may be more than one,) or the
> character encodings used by external libraries or applications that either
> offer callpoints to, or consume callpoints from, a Squeak process. It also
> documents the encoding used by each String.
Nothing prevents you from using a String as if it is, say, a
ByteArray. For example, you can pass a String or ByteArray to a
socket primitive to fill it and you can keep the bits in it as you
However, Smalltalk is not just about holding data; once it comes to
displaying a String, concatenating them, comparing them, etc., etc.,
you do have to have a canonical form.
> If all Strings use UTF-32, and are only converted to other encodings by the
> VM, how does one write Smalltalk code to convert text from one character
> encoding to another? I'd rather not make character encodings yet another
> bit of magic that only the VM can do.
Hmm. Of course you can convert encodings in memory. In Squeak,
there are bunch of subclasses of TextConverter. Did anybody
mentioned/suggested that the conversion has to be a VM magic?
> It is already the case that accessing individual characters from a String
> results in the reification of a Character object. So, leveraging what is
> already the case, convervsion to/from the internal encoding to the canonical
> (Unicode) encoding should occur when a Character object is reified from an
> encoded character in a String (or in a Stream.) Character objects that are
> "put:" into a String would be converted from the Unicode code point to the
> encoding native to that String. Using Character reification to/from Unicode
> as the unification mechanism provides the illusion that all Strings use the
> same code points for their characters, even though they in fact do
You criticized an approach nobody advocated as "magic" in above, but
what you wrote here is really a magic. I've got a feeling that this
system would be very hard to debug.
BTW, what would you do with Symbols?
> Of course, for some encodings (such as UTF-8) there would probably be a
> performance penalty for accessing characters at an arbitrary index ("aString
> at: n.") But there may be good ways to mitigate that, using clever
> implementation tricks (caveat: I haven't actually tried it.) However, with
> my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all
> Strings, or ASCII for all Strings--based on one's space and performance
> constraints, and based on the character repertoire one needs for one's user
> base. And the conversion to UTF-16 or UTF-32 (or whatever) can be done when
> the String is read from an external Stream (using the VW stream decorator
> approach, for example.)
I *do* see some upsides of this approach, actually, but the
downsides is overwhelming bigger, if you think that Smalltalk is a
self-contained system. Handling keyboard input alone would make the
system really complex.
IIUC, Matsumoto-san's (Matz) m17n idea for Ruby is sort of along
this line. I don't think that is a good approach, but it is slightly
more acceptable in Ruby, because Ruby is not a whole system.
BTW, current Squeak allows you to do this. Within the 32-bit
quantity, the first several bits denotes the "language"; you can make
up a special language and store the code point in different encodings.
> The ASCII encoding would be good for the mutlitude of legacy applications
> that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx
> legacy applications that have to deal with non-English languages, or have to
> deal with either HTML or pre-Vista Windows. UTF-x would be best for most
> other situations.
Is this your observation? Where does legacy application in Japanese
fit? Why HTML is associated with latin-1? What is special about
Vista Windows? This doesn't make any good sense.
One approach I might try in a "new system" would be:
- the bits of raw string representation is in UTF-8 but it is not
- you always do stuff though an equivalent of Text, that carry
enough attributes for the bits.
- maybe remove character object. A "character" is just a short Text.
For the ASCII part, it could be a special case; i.e., a naked
byte can have implicit text attributes by default.
More information about the Squeak-dev