UTF8 Squeak

Colin Putney cputney at wiresong.ca
Sat Jun 9 18:54:02 UTC 2007


On Jun 9, 2007, at 12:24 AM, stephane ducasse wrote:

> Could you say the difference between WidString and UTF-8 (UTF-8  
> would a specialized WideString?).

WideString is a fixed length encoding - each character is 4 bytes  
long. UTF-8 is a variable length encoding - where each character  
could be 1, 2 or 3 bytes.

The problem with WideString is that it wastes memory. Most characters  
can fit into 2 bytes, and all of them can fit into 3 bytes.

The problem with UTF-8 is that it makes random access expensive.  
UTF8String>>at: would have to do a linear search through the string  
to find the character at a given offset.

> I got bitten by these encodings problems and having a nice solution  
> would be good.

I don't think there's a single solution that's good for all problems.  
For the kind of web applications that I work on, UTF-8 is  a clear  
win. For other kinds of applications , WideString and maybe  
TwoByteString are probably better.

Colin



More information about the Squeak-dev mailing list