UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Mon Jun 11 11:04:53 UTC 2007

Hi Colin,

Colin Putney wrote:
> On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:
>> I think that this way we can achieve most efficient yet fast support 
>> for all languages on that world. Because of fixed length those strings 
>> are also easy to manipulate contrary to variable length UTF-8 ones.
> "Most efficient yet fast" is a matter of perspective. For the apps I 
> work on, UTF-8 is better than your scheme because space efficiency is 
> more important than random access, and time spent encoding and decoding 
> UTF-8 would dwarf time spent scanning for random access.

Anyone can definitively stay with UTF8 encoded strings in plan BytString 
or subclass to UTF8String by himself. But I don't know why we need to 
have UTF8String as part of string framework. Just because of meaning? 
Then we also need to introduce an ASCIIString :)

> As soon as you try to support more than 256 characters, there are 
> trade-offs to be made. The "ideal" solution depends on your application. 
> How important is memory efficiency vs. space efficiency? How about 
> stream processing vs random access? What format is your input and 
> output? Which characters do you need to support, and how many of them 
> are there?
 > A good string library will be flexible enough to allow its users to
 > make those trade-offs according to the needs of the application.

I think that preserving simplicity is also an important goal. We need to 
find a general yet simple solution for Unicode Strings, which will be 
good enough for most uses, as is the case for numbers for instance. We 
deal with more special cases separately. I claim that pure Unicode 
strings in Byte, TwoByte or FourByteString is such a general support. 
UTF8String is already a specific one.
>> Conversion to/form UTF-8 could probably also be simpler with help of 
>> bit arithmetic algorithms, which would be tailored differently for 
>> each of proposed three string subclasses above.
> Yes, a couple of well designed primitives would help quite a bit.

I study UTF8 conversion and it is designed to be efficient, almost as 
usual copy. I already did those conversion methods and now I'm preparing 
for benchmarks. If conversion is really as fast as a copy, then there 
are really not much arguments anymore to convert always to inner Unicode 
by default?

Best regards

Janko Mivšek
Smalltalk Web Application Server

More information about the Squeak-dev mailing list