Beyond gzip

Dan Ingalls Dan.Ingalls at disney.com
Sat May 26 18:31:58 UTC 2001


Here's a challenge for anyone who knows the squeak image format, and has a little time on their hands...

Gzip gets just short of 50% out of a typical squeak image.  It doesn't have a clue about squeak images.  What do you think is possible with a scheme that has a clue?

Here's the challenge:
Take a Mini image (majorShrunk and abandoned sources) or the 3.0 release image, or a 3.1 alpha image full of content, and figure how small you could squeeze it if you're clever.

Things that would be useful:
Run printSpaceAnalysis to find out how much is in Strings, CompiledMethods, etc.
Figure how many words are pointers
For the pointers, figure how many are SmallInts (what ranges?), nils and other pointers.
For the other pointers, how local are they (stats on + - distances)

Things you'll probably find:
The strings (and symbols) should be comressible nearly 4:1 using gzip
The byteArrays are probably not easily compressed since most (except for sounds) are bitmaps that have already been compressed by run-coding.
The bytes of CompiledMethods are probably also hard to compress, since they are already Huffman coded in an ad-hoc sort of way.
An image full of content may have lots of secondary patterns, like points and rectangles that could be encoded specially.

Hint:
The InterpreterSimulator is a convenient tool for looking at images as data.  It can read the image format, and it has a number of methods for enumerating all the objects and all their fields.

It would be nice to know what's possible.  If we could get a factor of four, it might be worth implementing a special compressor.  If it's not that good then gzip is probably the best bet for simplicity, speed and the fact that we already have it.

	- Dan






More information about the Squeak-dev mailing list