UTF8 Squeak

Tue Jun 12 06:29:24 UTC 2007

On Jun 11, 2007, at 1:31 PM, Janko Mivšek wrote:

>> Is that a fair characterization of your position?
> Yes, or just a bit better said: my position is a separation of  
> internal string representation from encodings. Internal strings  
> should be in pure Unicode while conversions to other encodings  
> should be done separately, probably best with already existing  
> TextEncoders. Those text encoders can be extended to meet wider  
> requirements, but strings shall stay strings - they shall contain  
> characters only.

Well, this is progress, of a sort. What you write above would imply  
that Strings should be arrays of pointers to Character objects. Your  
proposal is actually to have strings encoded as ISO 8859-1, UCS-2 or  
UCS-4. That's a reasonable optimization to save space, so long as the  
semantics of strings are preserved - other objects can't tell what  
the internal representation is, because all they see are characters.

But if encapsulation works for fixed length encodings, why not for  
UTF-8 or UTF-16?

> By the way, I'm a web developer too and porting Aida to Squeak  
> actually started my interest on Unicode support here :)

Yeah, I was wondering about that. Does Aida do a whole lot of work on  
string buffers or something? Doesn't it use streams? Why are you so  
dead set against variable length encodings?

One other thing: you seem to be advocating that Squeak just adopt the  
same design that VisualWorks uses. VisualWorks is great, but it does  
have immediate Characters, which Squeak does not. That changes the  
design constraints a bit.

Colin