UTF8 Squeak
Colin Putney
cputney at wiresong.ca
Sat Jun 9 18:54:02 UTC 2007
On Jun 9, 2007, at 12:24 AM, stephane ducasse wrote:
> Could you say the difference between WidString and UTF-8 (UTF-8
> would a specialized WideString?).
WideString is a fixed length encoding - each character is 4 bytes
long. UTF-8 is a variable length encoding - where each character
could be 1, 2 or 3 bytes.
The problem with WideString is that it wastes memory. Most characters
can fit into 2 bytes, and all of them can fit into 3 bytes.
The problem with UTF-8 is that it makes random access expensive.
UTF8String>>at: would have to do a linear search through the string
to find the character at a given offset.
> I got bitten by these encodings problems and having a nice solution
> would be good.
I don't think there's a single solution that's good for all problems.
For the kind of web applications that I work on, UTF-8 is a clear
win. For other kinds of applications , WideString and maybe
TwoByteString are probably better.
Colin
More information about the Squeak-dev
mailing list
|