A scheme like this was used in BBN Lisp in the late sixties .... worked well given the propensity of most lists to run in the cdr direction ....
Cheers,
Alan
-----
At 11:46 AM -0500 5/28/01, Lex Spoon wrote:
Dan Ingalls Dan.Ingalls@disney.com wrote:
In this same vein, you can see where I was going with pointer locality. It would be a very quick loop to make every pointer reference be self-relative, which might well give us a bunch of small numbers, meaning lots of zeroes, and hence better compression.
A fun problem!
I took a try this weekend at rearranging objects to be topologically sorted. If you do "(InterpreterSimulator new openOn: 'mini.image') topologicalSort", then you get a simulator back with all the objects stored in the order of a depth-first search. By doing this sort, the #stats method reports that #oops3k improves from 3921 to 16968, out of a total of 41k objects. That means over a third of the references can use 16 bits instead of 32 -- not as much as I expected, but still a lot. I didn't check, but probably a lot of the references are even pointing to the *next* object, which means you only need *0* bits to encode the distance!
(Funny story: last time I played with a variable length integer encoding, it turned out in fact that the optimal length for the "small" integers was 0 bits, and that having "medium"-length integers didn't help!)
With this in mind, it would seem a good approach for encoding oops is to have 3 tag bits:
bit 0: whether it's an integer or not bit 1-2: for regular oops, how the oop is encoded: 00: the object immediately follows this one 01: 16 bits of reference follow 10: max bits of reference follow (size is log(image size)) 11: a special reference follows, eg "nil"
Alternatively, 00 could be used for "3 bits, indicating the number of *objects* past this one to get to the next one". "number of objects" is more concise than "number of words", but I don't see a way for it to be practical *except* for small numbers. (And oh yes: drop two low bits from those distances to begin with! Sure they compress well, but the best compression is to get rid of the data to start with.)
A few other ideas:
- Separate out the word or bytes data of objects that have it,
especially with strings -- this should take Andreas's idea about putting strings next to each other, even further. Store the "words" data in the same order as the regular "objects" data, and there is no space needed to help zip the objects back together later. Pragmatically, one could do the separation by diving the snapshot file into interleaved 64k segments, each of which is either all object data or all word/bytes data.
- Possibly, put the hash bits somewhere else, too, so that repetitions
among objects will be caught by gzip. (If we really want to go far, perhaps the hashes could be dumped and images could do a full rehash when they load back up? How long does it take?)
- Redo object headers completely. There are lots of possibilities,
but for starters: Put the main header first, to help the decoder. Leave out the GC and root bits. Leave out the compact size and class bits, and instead put the size and class after the header using some clever encoding.
That's what I came up with. Unfortunately, I can't quite test these things myself because my #topologicalSort has a wierd bug. I'll attach it -- maybe there is some problem that screams out to people?
The bug is brought out by the garbage collector: it changes the header type bits of activeContext to be binary 10, but it never manages to change them back. I put a #halt in #upward, and it seems that, bizarre as it sounds, the GC terminates just fine but that it ends up at a different object from the one it started at! It's not even a real object: the header is CACACACA. I haven't traced through the GC enough to figure out exactly what is going on.
This problem is especially strange, because the data looks so pristine. #printLong: works on every oop I try. #stats reports the same number of objects and integers and nil's as before. All the debugging messages in Interpreter, including #allAccessibleObjectsOkay, report no errors. And the number of bytes in the object memory is the same before and after. I can't figure it out, and the weekend is up.
So I'm sending this around, in case it helps someone else get started. Cheers!
Lex Content-type: application/octet-stream Content-disposition: attachment;filename="ImageCompact.3.cs"
Attachment converted: Macintosh HD:ImageCompact.3.cs (????/----) (00022289)