UTF8 Squeak

Colin Putney cputney at wiresong.ca
Fri Jun 8 22:02:05 UTC 2007


On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:

> How about trying to improve the speed of conversions? You seem to  
> imply that this is the major issue here, so if the conversions  
> where blindingly fast (which I think they easily could by writing  
> one or two primitives) this should improve matters.

The conversions could be made faster, yes. But consider this: the  
life-cycle of a string in a web app is very often something like this:

- comes in over HTTP
- lives in the image for a while, maybe persisted in some way
- gets sent back out over HTTP many times

Even if the conversion *is* blindingly fast, it's still better to  
leave it as UTF-8 the whole time, not only to remove the overhead of  
decoding and reencoding, but also to avoid storing WideStrings in the  
image for long periods of time. Also, consider that building html  
pages mainly involves writing lots of short strings to streams, which  
sometimes include non-ASCII characters. If they can be pre-encoded  
it's another space and time win. On the other hand, the traditional  
drawback to UTF-8, random access to characters, doesn't come up much  
with generating web pages, though of course a web app may do this  
kind of thing as part of its domain functionality.

I don't claim that all strings should always be UTF-8, but having a  
UTF8String class would be an excellent thing.

Colin



More information about the Squeak-dev mailing list