UTF8 Squeak
Janko Mivšek
janko.mivsek at eranova.si
Tue Jun 12 22:49:25 UTC 2007
Hi Andreas,
Here is an analysis of one VW "image as a database" which runs a
document portal for quality management system of one of our
pharmaceutical distributors.
image size: 113MB
instances size total avg size byte size
ByteString 355.068 8.562.097 24 9.982.369
TwoByteString 19.848 5.372.602 541 10.824.596
If I remember correctly byte indexed objects have 4 byte header in VW,
therefore:
byte size = 4 bytes per header + size (2*size for TwoByteString)
This should also be rounded up to 4 bytes, which I ignored for now.
Strings therefore contain approx.20% of whole image. If 4B strings would
be used instead of 2B ones, a string space increase would be:
byte size with 2BString: 20.806.965
byte size with 4BString: 31.552.169
increase %: 52%
So, 2x bigger image was really an exaggerated statement but you can see
from those results that image would grow quite extensively if 4B strings
would be used instead of 2B ones.
Best regards
Janko
Andreas Raab wrote:
> Janko Mivšek wrote:
>>> so, all of this talk is for about 4 MB extra (in that image squeak
>>> take 26.8 MB at startup)?.
>>
>> Consider image as a database where you store strings from your
>> application. In that case space efficient but still manipulable
>> strings really matter. For instance, I run one 380MB VW image full of
>> TwoByteStrings and this image would probably have 760M with only
>> FourByteStrings ...
>
> Actually, I would be very interested in a more accurate answer than
> "probably" since the 2x answer assumes that the whole image consists of
> 2-byte strings and that there is zero overhead for headers etc. both of
> which is obviously not the case. If you wouldn't mind, could you run a
> little script that computes the number of characters that are actually
> stored as 2 bytes? Something like:
>
> TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>
> This strictly counts the number of characters that "matter", i.e., that
> are affected by an encoding change and I'd be interested in getting some
> data point about how that looks in a real application (e.g., whether
> that is in the 10%, 25%, or 50% range). In particular considering that
> VW probably uses the most compact form by default and that there is
> probably quite a bit of application code running and that there is
> probably more than just strings to keep in the data, I'm really curious
> how much of that ends up to be relevant for the 2-byte encoding.
>
> Thanks,
> - Andreas
>
>
--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si
More information about the Squeak-dev
mailing list
|