At Wed, 24 Sep 2008 10:49:18 +0200, Damien Pollet wrote:
Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?
I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside".
In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString, but the distinction is more or less hidden from a casual user. The conversion is only needed when the String is interfacing with the outside of the image.
A Unicode code point doesn't really corresponds to the concept of a character, if you think an accented character a "character". The original concept of Unicode was that such "character" should be always represented as the sequence of code points; one base character, and one or more accent marks. It was at least pure and fair.
But, they got the "Latin-1 compatibility" idea around 1990 in a retrofitted way; so the original idea of "Let us make a universal character set for everybody in the world" was turned to: "Let us make a universal character set for everybody in the world, but let's treat Westerners nicer." But of course this turn made the situation where a simple accented character has two (precomposed and decomposed) representations. Squeak is still way behind and prefers the precomposed "normalization", but the normalization is really lax there.
To me, the han unification is another evidence of "Westerners first" idea. If tracing back to the origin of characters is the concept, i and j should be perhaps unified as well (just kidding).
But, Unicode is the standard now, and it does solve a lot of problems. So using it as the base but putting necessary information around it to support it is a good way in principle.
If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now.
-- Yoshiki