UTF8 Squeak
Colin Putney
cputney at wiresong.ca
Tue Jun 12 23:31:21 UTC 2007
On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:
> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>
>
>> Your proposal is actually to have strings encoded as ISO 8859-1,
>> UCS-2 or UCS-4.
>>
>
> Actually, the idea is that a String has Unicode throughout, with no
> encoding. A string is simply a flat array of Unicode code points.
>
> To optimize space usage we choose the lowest number of bytes per
> character that can encompass all code points in a String. This is
> implemented as specialized subclasses of String. So for code points
> below 256 we use ByteString (8 bit per char), for all others
> WideString (32 bits per char). This is purely space optimization,
> not a change in encoding.
>
Yes, I understand how m17n was implemented in Squeak. I'm trying to
challenge one of the ideas that underlies Janko's proposal, which you
layout beautifully above: "String has Unicode throughout, with no
encoding." And again at the end: "This is purely space optimization,
not a change in encoding."
If a String were a flat array of Unicode code points, it would be
implemented in Smalltalk as an array of Characters wouldn't it? The
fact that you've chosen to hide the internal representation of the
string and use a "variable byte" or "variable word" subclass to store
bytes, rather than objects, is an indication that the strings *are*
encoded. In fact, the encodings have names: ISO 8859-1 and UCS-4.
Janko is proposing to add a string class that internally stores
strings encoded in UCS-2 to the mix.
So what's so holy about these particular encodings, besides the fact
that they're especially efficient on the VisualWorks VM?
Colin
More information about the Squeak-dev
mailing list
|