UTF8 Squeak

Sat Jun 9 19:24:59 UTC 2007

2007/6/9, Janko Mivšek <janko.mivsek at eranova.si>:
> Philippe Marschall wrote:
> >> I got bitten by these encodings problems and having a nice solution
> >> would be good.
> > Well, there is what the evil language with J does: UCS2 everywhere, no
> > excuses. This is a bit awkward for characters outside the BMP (which
> > are more rare than unicorns) but IIRC the astral planes didn't exits
> > when it was created. So you could argue for UCS4. Yes, it's twice the
> > size, but who really cares? If you could get rid of all the size hacks
> > in Squeak that were cool in the 70ies, would you?
>
> All of us who use image as a database care about space efficiency but on
> the other side we want all normal string operations to run on unicode
> strings too.

The image is not an efficient database. It stores all kind of "crap"
like Morphs. And it sucks as a database (ACID transactions anyone?).
Don't even get me started on migration (like Squeak Chronlogy
classes).

Philippe

> That's why UTF8 encoded string is not appropriate even that
> it is most space efficient, because string operations are not fast enough.
>
> I would propose a hibrid solution: three subclasses of String:
>
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others
>
> And even for 2nd group and for short strings a plain ASCII satisfies in
> many cases. For Slovenian I would say for 80% of short strings (we have
> only čšžČŠŽ as non-ascii chars). I think most of latin Europe has
> similar situation.
>
> Conversion between strings should be automatic as is with numbers. You
> start with ASCII only ByteString and when you first encounter a
> character >256, you convert to TwoByteString, and then to FourByteString.
>
> Best regards
> Janko
>
>
>
> >> Stef
> >>
> >> On 9 juin 07, at 00:02, Colin Putney wrote:
> >>
> >> >
> >> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
> >> >
> >> >> How about trying to improve the speed of conversions? You seem to
> >> >> imply that this is the major issue here, so if the conversions
> >> >> where blindingly fast (which I think they easily could by writing
> >> >> one or two primitives) this should improve matters.
> >> >
> >> > The conversions could be made faster, yes. But consider this: the
> >> > life-cycle of a string in a web app is very often something like this:
> >> >
> >> > - comes in over HTTP
> >> > - lives in the image for a while, maybe persisted in some way
> >> > - gets sent back out over HTTP many times
> >> >
> >> > Even if the conversion *is* blindingly fast, it's still better to
> >> > leave it as UTF-8 the whole time, not only to remove the overhead
> >> > of decoding and reencoding, but also to avoid storing WideStrings
> >> > in the image for long periods of time. Also, consider that building
> >> > html pages mainly involves writing lots of short strings to
> >> > streams, which sometimes include non-ASCII characters. If they can
> >> > be pre-encoded it's another space and time win. On the other hand,
> >> > the traditional drawback to UTF-8, random access to characters,
> >> > doesn't come up much with generating web pages, though of course a
> >> > web app may do this kind of thing as part of its domain functionality.
> >> >
> >> > I don't claim that all strings should always be UTF-8, but having a
> >> > UTF8String class would be an excellent thing.
> >> >
> >> > Colin
> >> >
> >> >
> >>
> >>
> >>
> >
> >
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>
>