UTF8 Squeak

Andreas Raab andreas.raab at gmx.de
Wed Jun 13 07:00:04 UTC 2007


Hi Janko,

Thanks for the numbers, that's quite interesting to see. It seems that 
the total increase in image size would be roughly 10% all things 
considered (from approx. 113 MB to approx. 123 MB). That's actually less 
than I would have intuitively expected (my guess was in the range of 20% 
total image size).

Of course a single data point is no proof of anything but thanks again 
for taking the time and getting us a few numbers.

Cheers,
   - Andreas

Janko Mivšek wrote:
> Hi Andreas,
> 
> Here is an analysis of one VW "image as a database" which runs a 
> document portal for quality management system of one of our 
> pharmaceutical distributors.
> 
> image size: 113MB
> 
>                 instances  size total  avg size   byte size
> ByteString      355.068     8.562.097     24      9.982.369
> TwoByteString    19.848     5.372.602    541     10.824.596
> 
> If I remember correctly byte indexed objects have 4 byte header in VW, 
> therefore:
> 
> byte size = 4 bytes per header + size (2*size for TwoByteString)
> 
> This should also be rounded up to 4 bytes, which I ignored for now.
> 
> Strings therefore contain approx.20% of whole image. If 4B strings would 
> be used instead of 2B ones, a string space increase would be:
> 
>     byte size with 2BString: 20.806.965
>     byte size with 4BString: 31.552.169
>                  increase %:        52%
> 
> So, 2x bigger image was really an exaggerated statement but you can see 
> from those results that image would grow quite extensively if 4B strings 
> would be used instead of 2B ones.
> 
> Best regards
> Janko
> 
> 
> Andreas Raab wrote:
>> Janko Mivšek wrote:
>>>> so, all of this talk is for about 4 MB extra (in that image squeak 
>>>> take 26.8 MB at startup)?.
>>>
>>> Consider image as a database where you store strings from your 
>>> application. In that case space efficient but still manipulable 
>>> strings really matter. For instance, I run one 380MB VW image full of 
>>> TwoByteStrings and this image would probably have 760M with only 
>>> FourByteStrings ...
>>
>> Actually, I would be very interested in a more accurate answer than 
>> "probably" since the 2x answer assumes that the whole image consists 
>> of 2-byte strings and that there is zero overhead for headers etc. 
>> both of which is obviously not the case. If you wouldn't mind, could 
>> you run a little script that computes the number of characters that 
>> are actually stored as 2 bytes? Something like:
>>
>>   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>>
>> This strictly counts the number of characters that "matter", i.e., 
>> that are affected by an encoding change and I'd be interested in 
>> getting some data point about how that looks in a real application 
>> (e.g., whether that is in the 10%, 25%, or 50% range). In particular 
>> considering that VW probably uses the most compact form by default and 
>> that there is probably quite a bit of application code running and 
>> that there is probably more than just strings to keep in the data, I'm 
>> really curious how much of that ends up to be relevant for the 2-byte 
>> encoding.
>>
>> Thanks,
>>   - Andreas
>>
>>
> 




More information about the Squeak-dev mailing list