UTF8 Squeak
Andreas Raab
andreas.raab at gmx.de
Mon Jun 11 22:38:35 UTC 2007
Janko Mivšek wrote:
>> so, all of this talk is for about 4 MB extra (in that image squeak
>> take 26.8 MB at startup)?.
>
> Consider image as a database where you store strings from your
> application. In that case space efficient but still manipulable strings
> really matter. For instance, I run one 380MB VW image full of
> TwoByteStrings and this image would probably have 760M with only
> FourByteStrings ...
Actually, I would be very interested in a more accurate answer than
"probably" since the 2x answer assumes that the whole image consists of
2-byte strings and that there is zero overhead for headers etc. both of
which is obviously not the case. If you wouldn't mind, could you run a
little script that computes the number of characters that are actually
stored as 2 bytes? Something like:
TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
This strictly counts the number of characters that "matter", i.e., that
are affected by an encoding change and I'd be interested in getting some
data point about how that looks in a real application (e.g., whether
that is in the 10%, 25%, or 50% range). In particular considering that
VW probably uses the most compact form by default and that there is
probably quite a bit of application code running and that there is
probably more than just strings to keep in the data, I'm really curious
how much of that ends up to be relevant for the 2-byte encoding.
Thanks,
- Andreas
More information about the Squeak-dev
mailing list
|