UTF8 Squeak

Wed Jun 13 10:37:40 UTC 2007

On Jun 13, 2007, at 1:31 , Colin Putney wrote:

> On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:
>
>
>> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>>
>>
>>> Your proposal is actually to have strings encoded as ISO 8859-1,  
>>> UCS-2 or UCS-4.
>>>
>>
>> Actually, the idea is that a String has Unicode throughout, with  
>> no encoding. A string is simply a flat array of Unicode code points.
>>
>> To optimize space usage we choose the lowest number of bytes per  
>> character that can encompass all code points in a String. This is  
>> implemented as specialized subclasses of String. So for code  
>> points below 256 we use ByteString (8 bit per char), for all  
>> others WideString (32 bits per char). This is purely space  
>> optimization, not a change in encoding.
>>
>
> Yes, I understand how m17n was implemented in Squeak. I'm trying to  
> challenge one of the ideas that underlies Janko's proposal, which  
> you layout beautifully above: "String has Unicode throughout, with  
> no encoding." And again at the end: "This is purely space  
> optimization, not a change in encoding."
>
> If a String were a flat array of Unicode code points, it would be  
> implemented in Smalltalk as an array of Characters wouldn't it?

If that was as efficient as the current implementation it certainly  
would. From the outside it certainly appears as an array of Characters.

> The fact that you've chosen to hide the internal representation of  
> the string and use a "variable byte" or "variable word" subclass to  
> store bytes, rather than objects, is an indication that the strings  
> *are* encoded.

I'd say the main rationale for this was for optimization.

The implementation if Strings in Squeak has always been as a  
variableByteSubclass, the numerical value of the bytes in the String  
are the Character's value. This means you could only have Characters  
with value 0 to 255 in a String. Now, to extend that range we have  
WideStrings, which are an extension as natural as extending the  
SmallInteger range by LargeIntegers. It still holds that the  
numerical value of each word in a WideString is identical to the  
Character's value at that position. There is no interpretation in the  
mapping between the internal representation and the external appearance.

> In fact, the encodings have names: ISO 8859-1 and UCS-4.

This is an unavoidable coincidence. We just do Unicode. The 8-bit  
subset of Unicode happens to coincide with ISO 8859-1. And UCS-4  
happens to be 32 bits.

> Janko is proposing to add a string class that internally stores  
> strings encoded in UCS-2 to the mix.

That's one way to say it. The other way to say it is to not store  
unnecessary 0-bytes for Unicode characters that are less than 65536.  
Same as we do for characters below 256.

> So what's so holy about these particular encodings, besides the  
> fact that they're especially efficient on the VisualWorks VM?

I have no idea how VW came into the discussion. For a discussion why  
these "encodings" appear natural, see above.

So what are you proposing instead?

- Bert -