UTF8 Squeak

Janko Mivšek janko.mivsek at eranova.si
Tue Jun 12 22:49:25 UTC 2007

Hi Andreas,

Here is an analysis of one VW "image as a database" which runs a 
document portal for quality management system of one of our 
pharmaceutical distributors.

image size: 113MB

                 instances  size total  avg size   byte size
ByteString      355.068     8.562.097     24      9.982.369
TwoByteString    19.848     5.372.602    541     10.824.596

If I remember correctly byte indexed objects have 4 byte header in VW, 

byte size = 4 bytes per header + size (2*size for TwoByteString)

This should also be rounded up to 4 bytes, which I ignored for now.

Strings therefore contain approx.20% of whole image. If 4B strings would 
be used instead of 2B ones, a string space increase would be:

	byte size with 2BString: 20.806.965
	byte size with 4BString: 31.552.169
	             increase %:        52%

So, 2x bigger image was really an exaggerated statement but you can see 
from those results that image would grow quite extensively if 4B strings 
would be used instead of 2B ones.

Best regards

Andreas Raab wrote:
> Janko Mivšek wrote:
>>> so, all of this talk is for about 4 MB extra (in that image squeak 
>>> take 26.8 MB at startup)?.
>> Consider image as a database where you store strings from your 
>> application. In that case space efficient but still manipulable 
>> strings really matter. For instance, I run one 380MB VW image full of 
>> TwoByteStrings and this image would probably have 760M with only 
>> FourByteStrings ...
> Actually, I would be very interested in a more accurate answer than 
> "probably" since the 2x answer assumes that the whole image consists of 
> 2-byte strings and that there is zero overhead for headers etc. both of 
> which is obviously not the case. If you wouldn't mind, could you run a 
> little script that computes the number of characters that are actually 
> stored as 2 bytes? Something like:
>   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
> This strictly counts the number of characters that "matter", i.e., that 
> are affected by an encoding change and I'd be interested in getting some 
> data point about how that looks in a real application (e.g., whether 
> that is in the 10%, 25%, or 50% range). In particular considering that 
> VW probably uses the most compact form by default and that there is 
> probably quite a bit of application code running and that there is 
> probably more than just strings to keep in the data, I'm really curious 
> how much of that ends up to be relevant for the 2-byte encoding.
> Thanks,
>   - Andreas

Janko Mivšek
Smalltalk Web Application Server

More information about the Squeak-dev mailing list