UTF8 Squeak
Andreas Raab
andreas.raab at gmx.de
Wed Jun 13 07:00:04 UTC 2007
Hi Janko,
Thanks for the numbers, that's quite interesting to see. It seems that
the total increase in image size would be roughly 10% all things
considered (from approx. 113 MB to approx. 123 MB). That's actually less
than I would have intuitively expected (my guess was in the range of 20%
total image size).
Of course a single data point is no proof of anything but thanks again
for taking the time and getting us a few numbers.
Cheers,
- Andreas
Janko Mivšek wrote:
> Hi Andreas,
>
> Here is an analysis of one VW "image as a database" which runs a
> document portal for quality management system of one of our
> pharmaceutical distributors.
>
> image size: 113MB
>
> instances size total avg size byte size
> ByteString 355.068 8.562.097 24 9.982.369
> TwoByteString 19.848 5.372.602 541 10.824.596
>
> If I remember correctly byte indexed objects have 4 byte header in VW,
> therefore:
>
> byte size = 4 bytes per header + size (2*size for TwoByteString)
>
> This should also be rounded up to 4 bytes, which I ignored for now.
>
> Strings therefore contain approx.20% of whole image. If 4B strings would
> be used instead of 2B ones, a string space increase would be:
>
> byte size with 2BString: 20.806.965
> byte size with 4BString: 31.552.169
> increase %: 52%
>
> So, 2x bigger image was really an exaggerated statement but you can see
> from those results that image would grow quite extensively if 4B strings
> would be used instead of 2B ones.
>
> Best regards
> Janko
>
>
> Andreas Raab wrote:
>> Janko Mivšek wrote:
>>>> so, all of this talk is for about 4 MB extra (in that image squeak
>>>> take 26.8 MB at startup)?.
>>>
>>> Consider image as a database where you store strings from your
>>> application. In that case space efficient but still manipulable
>>> strings really matter. For instance, I run one 380MB VW image full of
>>> TwoByteStrings and this image would probably have 760M with only
>>> FourByteStrings ...
>>
>> Actually, I would be very interested in a more accurate answer than
>> "probably" since the 2x answer assumes that the whole image consists
>> of 2-byte strings and that there is zero overhead for headers etc.
>> both of which is obviously not the case. If you wouldn't mind, could
>> you run a little script that computes the number of characters that
>> are actually stored as 2 bytes? Something like:
>>
>> TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>>
>> This strictly counts the number of characters that "matter", i.e.,
>> that are affected by an encoding change and I'd be interested in
>> getting some data point about how that looks in a real application
>> (e.g., whether that is in the 10%, 25%, or 50% range). In particular
>> considering that VW probably uses the most compact form by default and
>> that there is probably quite a bit of application code running and
>> that there is probably more than just strings to keep in the data, I'm
>> really curious how much of that ends up to be relevant for the 2-byte
>> encoding.
>>
>> Thanks,
>> - Andreas
>>
>>
>
More information about the Squeak-dev
mailing list
|