UTF8 Squeak

Bert Freudenberg bert at freudenbergs.de
Tue Jun 12 11:28:55 UTC 2007


On Jun 12, 2007, at 8:29 , Colin Putney wrote:

> Your proposal is actually to have strings encoded as ISO 8859-1,  
> UCS-2 or UCS-4.

Actually, the idea is that a String has Unicode throughout, with no  
encoding. A string is simply a flat array of Unicode code points.

To optimize space usage we choose the lowest number of bytes per  
character that can encompass all code points in a String. This is  
implemented as specialized subclasses of String. So for code points  
below 256 we use ByteString (8 bit per char), for all others  
WideString (32 bits per char). This is purely space optimization, not  
a change in encoding.

Now, the proposal is to use an intermediate 2 byte representation for  
code points below 65536. Nobody has demonstrated the general  
usefulness of this optimization, yet. In particular since the Squeak  
VM does not support 16-bit arrays directly but they have to be  
emulated using 8 bit words or 32 bit words. For the latter, prims 144  
and 145 might help, but the problem of non-even length would have to  
be addressed.

Also, the "purity" of Unicode strings does not translate directly  
into the implementation, which reserves the most significant byte in  
a WideString word for a "language code". That byte is otherwise  
unused (code points range from 0 to 16r10FFFF) and is supposed to  
help choosing glyph shapes that share a code point but differ in  
appearance depending on the language. I suppose this was to restrict  
changes to the String hierarchy, a better place for language info  
would be text attributes - but then potentially a lot of code would  
have to be adapted to pass Texts rather than Strings. It might be  
worth to revise that design.

For dealing with encodings perhaps it would be useful to wrap a  
ByteArray with a codec into an EncodedString - that way encoded data  
could be passed from a webserver and back unmodified. #asString would  
use the codec to convert to a proper String, which might also be used  
for displaying that EncodedString. I'd not actually make it a String  
subclass so perhaps a name other than EncodedString would be better.

My €/50 ...

- Bert -






More information about the Squeak-dev mailing list