UTF8 Squeak

Javier Diaz-Reinoso javier_diaz_r at mac.com
Mon Jun 11 19:29:54 UTC 2007


On 11/06/2007, at 13:27, Yoshiki Ohshima wrote:

>   Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
>   For storing the bare Unicode code points, I think so.  I'm not
> convinced that adding 16-bit variation solves any real problems.  But
> there may be something.
>
>   My first a few questions are:
>
>   - While vast majority of strings for, say, Japanese can be
>     represented with in the characters in BMP, you would use
>     FourByteString for Chinese/Japanese/Korean and some others.  Does
>     this mean that you would *always* use FourByteString for these
>     "languages" (and not scripts?)
>
>   - Suppose you would like to use different line wrapping algorithms
>     for different languages, how would you keep that information?
>
> -- Yoshiki
>
About 2 months ago in the OpenMCL mailing list have this UTF16 vs.  
UTF32 discussion:
> how many angels can dance on a unicode character?
> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>

Gary Byers (the OpenMCL's developer) finish with this conclusion:
> If these numbers are roughly accurate and if the sketch of what
> a displaced SIMPLE-STRING object would look like is realistic,
> then I'd say that using UTF-16 to represent arbitrary Unicode
> characters in a realistic way costs about as much memory-wise
> as using UTF-32 does, is somewhat slower in the simplest cases
> and much slower in general, has very complex boundary
> cases once we step outside the BMP, and just generally doesn't
> seem to have many socially-redeeming qualities that I can see.

perhaps in Squeak is different (no alignment?), but if I doIt:  
(ByteString allInstances collect:[:s | s size] ) sum asFloat (in a  
3.8.1 basic image), I obtain:

1.943098e6, (63672 strings at 30.5 bytes average)

so, all of this talk is for about 4 MB extra (in that image squeak  
take 26.8 MB at startup)?.







More information about the Squeak-dev mailing list