UTF8 Squeak

Sun Jun 10 16:54:55 UTC 2007

On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:

> I think that this way we can achieve most efficient yet fast  
> support for all languages on that world. Because of fixed length  
> those strings are also easy to manipulate contrary to variable  
> length UTF-8 ones.

"Most efficient yet fast" is a matter of perspective. For the apps I  
work on, UTF-8 is better than your scheme because space efficiency is  
more important than random access, and time spent encoding and  
decoding UTF-8 would dwarf time spent scanning for random access.

As soon as you try to support more than 256 characters, there are  
trade-offs to be made. The "ideal" solution depends on your  
application. How important is memory efficiency vs. space efficiency?  
How about stream processing vs random access? What format is your  
input and output? Which characters do you need to support, and how  
many of them are there?

A good string library will be flexible enough to allow its users to  
make those trade-offs according to the needs of the application.

> Conversion to/form UTF-8 could probably also be simpler with help  
> of bit arithmetic algorithms, which would be tailored differently  
> for each of proposed three string subclasses above.

Yes, a couple of well designed primitives would help quite a bit.

Colin