Beyond gzip

Alan Kay Alan.Kay at disney.com
Mon May 28 17:39:20 UTC 2001


A scheme like this was used in BBN Lisp in the late sixties .... 
worked well given the propensity of most lists to run in the cdr 
direction ....

Cheers,

Alan

-----

At 11:46 AM -0500 5/28/01, Lex Spoon wrote:
>Dan Ingalls <Dan.Ingalls at disney.com> wrote:
>>
>>  In this same vein, you can see where I was going with pointer 
>>locality.  It would be a very quick loop to make every pointer 
>>reference be self-relative, which might well give us a bunch of 
>>small numbers, meaning lots of zeroes, and hence better compression.
>>
>
>A fun problem!
>
>I took a try this weekend at rearranging objects to be topologically
>sorted.  If you do "(InterpreterSimulator new openOn: 'mini.image')
>topologicalSort", then you get a simulator back with all the objects
>stored in the order of a depth-first search.  By doing this sort, the
>#stats method reports that #oops3k improves from 3921 to 16968, out of a
>total of 41k objects.  That means over a third of the references can use
>16 bits instead of 32 -- not as much as I expected, but still a lot.  I
>didn't check, but probably a lot of the references are even pointing to
>the *next* object, which means you only need *0* bits to encode the
>distance!
>
>(Funny story: last time I played with a variable length integer
>encoding, it turned out in fact that the optimal length for the "small"
>integers was 0 bits, and that having "medium"-length integers didn't
>help!)
>
>With this in mind, it would seem a good approach for encoding oops is to
>have 3 tag bits:
>
>	bit 0:  whether it's an integer or not
>	bit 1-2:  for regular oops, how the oop is encoded:
>		00:  the object immediately follows this one
>		01:  16 bits of reference follow
>		10:  max bits of reference follow (size is log(image size))
>		11:  a special reference follows, eg "nil"
>
>
>Alternatively, 00 could be used for "3 bits, indicating the number of
>*objects* past this one to get to the next one".  "number of objects" is
>more concise than "number of words", but I don't see a way for it to be
>practical *except* for small numbers.  (And oh yes: drop two low bits
>from those distances to begin with!  Sure they compress well, but the
>best compression is to get rid of the data to start with.)
>
>
>A few other ideas:
>
>	1. Separate out the word or bytes data of objects that have it,
>especially with strings -- this should take Andreas's idea about putting
>strings next to each other, even further.  Store the "words" data in the
>same order as the regular "objects" data, and there is no space needed
>to help zip the objects back together later.  Pragmatically, one could
>do the separation by diving the snapshot file into interleaved 64k
>segments, each of which is either all object data or all word/bytes
>data.
>
>	2. Possibly, put the hash bits somewhere else, too, so that repetitions
>among objects will be caught by gzip.  (If we really want to go far,
>perhaps the hashes could be dumped and images could do a full rehash
>when they load back up?  How long does it take?)
>
>	3. Redo object headers completely.  There are lots of possibilities,
>but for starters: Put the main header first, to help the decoder.  Leave
>out the GC and root bits.  Leave out the compact size and class bits,
>and instead put the size and class after the header using some clever
>encoding.
>
>
>That's what I came up with.  Unfortunately, I can't quite test these
>things myself because my #topologicalSort has a wierd bug.  I'll attach
>it -- maybe there is some problem that screams out to people?
>
>The bug is brought out by the garbage collector: it changes the header
>type bits of activeContext to be binary 10, but it never manages to
>change them back.  I put a #halt in #upward, and it seems that, bizarre
>as it sounds, the GC terminates just fine but that it ends up at a
>different object from the one it started at!  It's not even a real
>object: the header is CACACACA.  I haven't traced through the GC enough
>to figure out exactly what is going on.
>
>This problem is especially strange, because  the data looks so pristine.
>  #printLong: works on every oop I try.  #stats reports the same number
>of objects and integers and nil's as before.  All the debugging messages
>in Interpreter, including #allAccessibleObjectsOkay, report no errors.
>And the number of bytes in the object memory is the same before and
>after.  I can't figure it out, and the weekend is up.
>
>So I'm sending this around, in case it helps someone else get started.
>Cheers!
>
>Lex
>Content-type: application/octet-stream
>Content-disposition: attachment;filename="ImageCompact.3.cs"
>
>Attachment converted: Macintosh HD:ImageCompact.3.cs (????/----) (00022289)





More information about the Squeak-dev mailing list