[squeak-dev] how to create an UTF-8 character

Colin Putney cputney at wiresong.ca
Wed Sep 24 14:45:38 UTC 2008


On 24-Sep-08, at 2:26 AM, Yoshiki Ohshima wrote:

> At Wed, 24 Sep 2008 10:49:18 +0200,
> Damien Pollet wrote:
>>
>> Is there a reason (other than history) why Strings are not  
>> collections
>> of unicode characters (at least as viewed from outside) rather than
>> bytes in some unknown encoding (which should be encapsulated and only
>> appear when text goes in and out the image) ? Or is it already like
>> that ?
>
>  I think the answer is that it is already *like that*, although I
> can't tell what you mean by "from outside".

I think Damien's confusion comes from the fact that the abstractions  
are a bit leaky. For example, if you do something like this:

'ábc' convertToEncoding: 'utf-8'

the result is 'ábc'. It's a string where the internal, "encapsulated"  
state is such that writing it to a socket or file will produce the  
desired bytes, but all in-image behavior is totally broken.

VisualWorks tends to do a better job of maintaining the abstractions,  
I think. The equivalent of the above example would product a ByteArray.

> If so, one could argue that we can just hold every string in
> decomposed UTF-8 in the image, and have a couple of variants of at:
> and at:put:.  The requirement of O(1) random access is not that
> important.  I might go that direction if I redo it now.

A UTF8String would be really handy for web applications, where strings  
come in from the net as UTF-8, live in the image for a while, then get  
sent out as UTF-8. O(1) random access isn't very useful, because  
strings are (mostly) uninterpreted, but converting to Squeak's  
internal representation is expensive.

The thing is, as long as the "sequence of characters" abstraction is  
maintained, it doesn't matter (for purposes of correct behavior) what  
the internal representation is. So it's perfectly reasonable to have  
multiple encodings with different performance profiles. UTF8String  
when you need it, WideString when that makes sense.

Colin


More information about the Squeak-dev mailing list