UTF8 Squeak

Andreas Raab andreas.raab at gmx.de
Mon Jun 11 22:38:35 UTC 2007


Janko Mivšek wrote:
>> so, all of this talk is for about 4 MB extra (in that image squeak 
>> take 26.8 MB at startup)?.
> 
> Consider image as a database where you store strings from your 
> application. In that case space efficient but still manipulable strings 
> really matter. For instance, I run one 380MB VW image full of 
> TwoByteStrings and this image would probably have 760M with only 
> FourByteStrings ...

Actually, I would be very interested in a more accurate answer than 
"probably" since the 2x answer assumes that the whole image consists of 
2-byte strings and that there is zero overhead for headers etc. both of 
which is obviously not the case. If you wouldn't mind, could you run a 
little script that computes the number of characters that are actually 
stored as 2 bytes? Something like:

   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].

This strictly counts the number of characters that "matter", i.e., that 
are affected by an encoding change and I'd be interested in getting some 
data point about how that looks in a real application (e.g., whether 
that is in the 10%, 25%, or 50% range). In particular considering that 
VW probably uses the most compact form by default and that there is 
probably quite a bit of application code running and that there is 
probably more than just strings to keep in the data, I'm really curious 
how much of that ends up to be relevant for the 2-byte encoding.

Thanks,
   - Andreas



More information about the Squeak-dev mailing list