UTF8 Squeak
Javier Diaz-Reinoso
javier_diaz_r at mac.com
Mon Jun 11 19:29:54 UTC 2007
On 11/06/2007, at 13:27, Yoshiki Ohshima wrote:
> Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
> For storing the bare Unicode code points, I think so. I'm not
> convinced that adding 16-bit variation solves any real problems. But
> there may be something.
>
> My first a few questions are:
>
> - While vast majority of strings for, say, Japanese can be
> represented with in the characters in BMP, you would use
> FourByteString for Chinese/Japanese/Korean and some others. Does
> this mean that you would *always* use FourByteString for these
> "languages" (and not scripts?)
>
> - Suppose you would like to use different line wrapping algorithms
> for different languages, how would you keep that information?
>
> -- Yoshiki
>
About 2 months ago in the OpenMCL mailing list have this UTF16 vs.
UTF32 discussion:
> how many angels can dance on a unicode character?
> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>
Gary Byers (the OpenMCL's developer) finish with this conclusion:
> If these numbers are roughly accurate and if the sketch of what
> a displaced SIMPLE-STRING object would look like is realistic,
> then I'd say that using UTF-16 to represent arbitrary Unicode
> characters in a realistic way costs about as much memory-wise
> as using UTF-32 does, is somewhat slower in the simplest cases
> and much slower in general, has very complex boundary
> cases once we step outside the BMP, and just generally doesn't
> seem to have many socially-redeeming qualities that I can see.
perhaps in Squeak is different (no alignment?), but if I doIt:
(ByteString allInstances collect:[:s | s size] ) sum asFloat (in a
3.8.1 basic image), I obtain:
1.943098e6, (63672 strings at 30.5 bytes average)
so, all of this talk is for about 4 MB extra (in that image squeak
take 26.8 MB at startup)?.
More information about the Squeak-dev
mailing list
|