UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Mon Jun 11 22:06:24 UTC 2007


Hi Javier,

Javier Diaz-Reinoso wrote:

> About 2 months ago in the OpenMCL mailing list have this UTF16 vs. UTF32 
> discussion:
>> how many angels can dance on a unicode character?
>> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>>
> 
> Gary Byers (the OpenMCL's developer) finish with this conclusion:
>> If these numbers are roughly accurate and if the sketch of what
>> a displaced SIMPLE-STRING object would look like is realistic,
>> then I'd say that using UTF-16 to represent arbitrary Unicode
>> characters in a realistic way costs about as much memory-wise
>> as using UTF-32 does, is somewhat slower in the simplest cases
>> and much slower in general, has very complex boundary
>> cases once we step outside the BMP, and just generally doesn't
>> seem to have many socially-redeeming qualities that I can see.

> perhaps in Squeak is different (no alignment?), but if I doIt: 
> (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1 
> basic image), I obtain:
> 
> 1.943098e6, (63672 strings at 30.5 bytes average)
> 
> so, all of this talk is for about 4 MB extra (in that image squeak take 
> 26.8 MB at startup)?.

Consider image as a database where you store strings from your 
application. In that case space efficient but still manipulable strings 
really matter. For instance, I run one 380MB VW image full of 
TwoByteStrings and this image would probably have 760M with only 
FourByteStrings ...

Best regards
JAnko

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



More information about the Squeak-dev mailing list