Beyond gzip

Mon May 28 16:46:45 UTC 2001

Dan Ingalls <Dan.Ingalls at disney.com> wrote:
> 
> In this same vein, you can see where I was going with pointer locality.  It would be a very quick loop to make every pointer reference be self-relative, which might well give us a bunch of small numbers, meaning lots of zeroes, and hence better compression.
> 

A fun problem!

I took a try this weekend at rearranging objects to be topologically
sorted.  If you do "(InterpreterSimulator new openOn: 'mini.image')
topologicalSort", then you get a simulator back with all the objects
stored in the order of a depth-first search.  By doing this sort, the
#stats method reports that #oops3k improves from 3921 to 16968, out of a
total of 41k objects.  That means over a third of the references can use
16 bits instead of 32 -- not as much as I expected, but still a lot.  I
didn't check, but probably a lot of the references are even pointing to
the *next* object, which means you only need *0* bits to encode the
distance!

(Funny story: last time I played with a variable length integer
encoding, it turned out in fact that the optimal length for the "small"
integers was 0 bits, and that having "medium"-length integers didn't
help!)

With this in mind, it would seem a good approach for encoding oops is to
have 3 tag bits:

	bit 0:  whether it's an integer or not
	bit 1-2:  for regular oops, how the oop is encoded:
		00:  the object immediately follows this one
		01:  16 bits of reference follow
		10:  max bits of reference follow (size is log(image size))
		11:  a special reference follows, eg "nil"

Alternatively, 00 could be used for "3 bits, indicating the number of
*objects* past this one to get to the next one".  "number of objects" is
more concise than "number of words", but I don't see a way for it to be
practical *except* for small numbers.  (And oh yes: drop two low bits
from those distances to begin with!  Sure they compress well, but the
best compression is to get rid of the data to start with.)

A few other ideas:

	1. Separate out the word or bytes data of objects that have it,
especially with strings -- this should take Andreas's idea about putting
strings next to each other, even further.  Store the "words" data in the
same order as the regular "objects" data, and there is no space needed
to help zip the objects back together later.  Pragmatically, one could
do the separation by diving the snapshot file into interleaved 64k
segments, each of which is either all object data or all word/bytes
data.

	2. Possibly, put the hash bits somewhere else, too, so that repetitions
among objects will be caught by gzip.  (If we really want to go far,
perhaps the hashes could be dumped and images could do a full rehash
when they load back up?  How long does it take?)

	3. Redo object headers completely.  There are lots of possibilities,
but for starters: Put the main header first, to help the decoder.  Leave
out the GC and root bits.  Leave out the compact size and class bits,
and instead put the size and class after the header using some clever
encoding.

That's what I came up with.  Unfortunately, I can't quite test these
things myself because my #topologicalSort has a wierd bug.  I'll attach
it -- maybe there is some problem that screams out to people?

The bug is brought out by the garbage collector: it changes the header
type bits of activeContext to be binary 10, but it never manages to
change them back.  I put a #halt in #upward, and it seems that, bizarre
as it sounds, the GC terminates just fine but that it ends up at a
different object from the one it started at!  It's not even a real
object: the header is CACACACA.  I haven't traced through the GC enough
to figure out exactly what is going on.

This problem is especially strange, because  the data looks so pristine.
 #printLong: works on every oop I try.  #stats reports the same number
of objects and integers and nil's as before.  All the debugging messages
in Interpreter, including #allAccessibleObjectsOkay, report no errors.
And the number of bytes in the object memory is the same before and
after.  I can't figure it out, and the weekend is up.

So I'm sending this around, in case it helps someone else get started. 
Cheers!

Lex
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ImageCompact.3.cs
Type: application/octet-stream
Size: 9479 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20010528/d8fe6fd7/ImageCompact.3.obj