UTF8 Squeak

Sat Jun 9 19:17:57 UTC 2007

Philippe Marschall wrote:
>> I got bitten by these encodings problems and having a nice solution
>> would be good.
> Well, there is what the evil language with J does: UCS2 everywhere, no
> excuses. This is a bit awkward for characters outside the BMP (which
> are more rare than unicorns) but IIRC the astral planes didn't exits
> when it was created. So you could argue for UCS4. Yes, it's twice the
> size, but who really cares? If you could get rid of all the size hacks
> in Squeak that were cool in the 70ies, would you?

All of us who use image as a database care about space efficiency but on 
the other side we want all normal string operations to run on unicode 
strings too. That's why UTF8 encoded string is not appropriate even that 
it is most space efficient, because string operations are not fast enough.

I would propose a hibrid solution: three subclasses of String:

1. ByteString for ASCII (native english speakers
2. TwoByteString for most of other languages
3. FourByteString(WideString) for Japanese/Chinese/and others

And even for 2nd group and for short strings a plain ASCII satisfies in 
many cases. For Slovenian I would say for 80% of short strings (we have 
only čšžČŠŽ as non-ascii chars). I think most of latin Europe has 
similar situation.

Conversion between strings should be automatic as is with numbers. You 
start with ASCII only ByteString and when you first encounter a 
character >256, you convert to TwoByteString, and then to FourByteString.

Best regards
Janko

>> Stef
>>
>> On 9 juin 07, at 00:02, Colin Putney wrote:
>>
>> >
>> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
>> >
>> >> How about trying to improve the speed of conversions? You seem to
>> >> imply that this is the major issue here, so if the conversions
>> >> where blindingly fast (which I think they easily could by writing
>> >> one or two primitives) this should improve matters.
>> >
>> > The conversions could be made faster, yes. But consider this: the
>> > life-cycle of a string in a web app is very often something like this:
>> >
>> > - comes in over HTTP
>> > - lives in the image for a while, maybe persisted in some way
>> > - gets sent back out over HTTP many times
>> >
>> > Even if the conversion *is* blindingly fast, it's still better to
>> > leave it as UTF-8 the whole time, not only to remove the overhead
>> > of decoding and reencoding, but also to avoid storing WideStrings
>> > in the image for long periods of time. Also, consider that building
>> > html pages mainly involves writing lots of short strings to
>> > streams, which sometimes include non-ASCII characters. If they can
>> > be pre-encoded it's another space and time win. On the other hand,
>> > the traditional drawback to UTF-8, random access to characters,
>> > doesn't come up much with generating web pages, though of course a
>> > web app may do this kind of thing as part of its domain functionality.
>> >
>> > I don't claim that all strings should always be UTF-8, but having a
>> > UTF8String class would be an excellent thing.
>> >
>> > Colin
>> >
>> >
>>
>>
>>
> 
> 

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si