UTF8 Squeak

Philippe Marschall philippe.marschall at gmail.com
Sat Jun 9 15:04:45 UTC 2007


2007/6/9, stephane ducasse <stephane.ducasse at free.fr>:
> Colin
>
> Could you say the difference between WidString and UTF-8 (UTF-8 would
> a specialized WideString?).
The way I understand it UTF8String would be a subclass of ByteString
and probably have methods like #size, #first:, #last: and #at:
overriden.

> I got bitten by these encodings problems and having a nice solution
> would be good.
Well, there is what the evil language with J does: UCS2 everywhere, no
excuses. This is a bit awkward for characters outside the BMP (which
are more rare than unicorns) but IIRC the astral planes didn't exits
when it was created. So you could argue for UCS4. Yes, it's twice the
size, but who really cares? If you could get rid of all the size hacks
in Squeak that were cool in the 70ies, would you?

Cheers
Philippe


> Stef
>
> On 9 juin 07, at 00:02, Colin Putney wrote:
>
> >
> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote:
> >
> >> How about trying to improve the speed of conversions? You seem to
> >> imply that this is the major issue here, so if the conversions
> >> where blindingly fast (which I think they easily could by writing
> >> one or two primitives) this should improve matters.
> >
> > The conversions could be made faster, yes. But consider this: the
> > life-cycle of a string in a web app is very often something like this:
> >
> > - comes in over HTTP
> > - lives in the image for a while, maybe persisted in some way
> > - gets sent back out over HTTP many times
> >
> > Even if the conversion *is* blindingly fast, it's still better to
> > leave it as UTF-8 the whole time, not only to remove the overhead
> > of decoding and reencoding, but also to avoid storing WideStrings
> > in the image for long periods of time. Also, consider that building
> > html pages mainly involves writing lots of short strings to
> > streams, which sometimes include non-ASCII characters. If they can
> > be pre-encoded it's another space and time win. On the other hand,
> > the traditional drawback to UTF-8, random access to characters,
> > doesn't come up much with generating web pages, though of course a
> > web app may do this kind of thing as part of its domain functionality.
> >
> > I don't claim that all strings should always be UTF-8, but having a
> > UTF8String class would be an excellent thing.
> >
> > Colin
> >
> >
>
>
>



More information about the Squeak-dev mailing list