UTF8 Squeak

Tue Jun 12 23:31:21 UTC 2007

On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:

> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>
>
>> Your proposal is actually to have strings encoded as ISO 8859-1,  
>> UCS-2 or UCS-4.
>>
>
> Actually, the idea is that a String has Unicode throughout, with no  
> encoding. A string is simply a flat array of Unicode code points.
>
> To optimize space usage we choose the lowest number of bytes per  
> character that can encompass all code points in a String. This is  
> implemented as specialized subclasses of String. So for code points  
> below 256 we use ByteString (8 bit per char), for all others  
> WideString (32 bits per char). This is purely space optimization,  
> not a change in encoding.
>

Yes, I understand how m17n was implemented in Squeak. I'm trying to  
challenge one of the ideas that underlies Janko's proposal, which you  
layout beautifully above: "String has Unicode throughout, with no  
encoding." And again at the end: "This is purely space optimization,  
not a change in encoding."

If a String were a flat array of Unicode code points, it would be  
implemented in Smalltalk as an array of Characters wouldn't it? The  
fact that you've chosen to hide the internal representation of the  
string and use a "variable byte" or "variable word" subclass to store  
bytes, rather than objects, is an indication that the strings *are*  
encoded. In fact, the encodings have names: ISO 8859-1 and UCS-4.  
Janko is proposing to add a string class that internally stores  
strings encoded in UCS-2 to the mix.

So what's so holy about these particular encodings, besides the fact  
that they're especially efficient on the VisualWorks VM?

Colin