Hi Dan,
That'll teach me not to throw comments about randomly ;-). What I actually have in mind is something between an abstract syntax tree (with symbol table) and a compiled method. Create the tree (no transformations or optimizations done in the parsing to create the AST - not even for #ifTrue: and kin) and "flatten" it into a RPN representation encoding the node type and and corresponding identifier string symbol into bytecodes. Actually I guess it _is_ a simple compression scheme -- because you would have a array of identifier strings, each string with a corresponding bytecode for its index. The pseudo-compiled code would simply be a bytearray of node-type/array-index "pairs" (ignoring arrays larger than 256 for the moment). The nice thing would be that this representation is independent of implementation and it is completely shareable -- all info is included for both compilation and for source reconstruction and could be transmitted reasonably efficiently. It wouldn't matter if you were compiling to another bytecode set, native code, threaded code,.... This representation could also be used in the image for a type of "lazy compilation", where the core methods and classes of Squeak are fully compiled, but the rest could be represented in this form, to be fully compiled only when finally sent a message.
On comments: if comments were considered as a peculiar type of language construct rather than just another form of whitespace, then it could be inserted into the tree along with the rest of the nodes; the CommentNodes could act as placeholders and be placed at the proper position in the tree for source reconstruction.
Since there are fewer that 16 node types, an encoding of NNNNIIII IIIIIIII could be used for up to 4K identifiers in a single method (yuck!).
Just some vague notions I've been rolling around a bit. I'm sure there are a number of gotchas to doing it this way that I haven't thought of.
-- DWight
Dan Ingalls wrote:
Dwight Hughes dwighth@ipa.net wrote ...
He is wanting to improve the prettyPrinting to the point that it would be the preferred option (and be customizeable of course) -- allowing most of the source to be drastically reduced in size, down to basically parse trees.
Just to be clear about a couple of things...
The prettyPrinting with comments would save space because METHODS are extremely compact; not because Parse Trees are; they are not. In fact I believe parse trees are the worst of the three (text, parse trees, methods) in compactness. What's nice is that the decompiler can build parse trees from methods, and those, in turn can (re)generate source code.
...and that's not to say I'm opposed to working with parse trees either. The original Fabrik (and ThingLab before it) built parse trees from its graphical structure and then generated code from the result.
- Dan
P.S. There are other nice things about parse trees, such as being able to execute them for type. A simple type inference mechanism can be built this way. One should be able to execute the methods themselves for type, but the compilation of ifTrue: and the other optimized control constructs prevents it.
Since the muddled commentary of my previous message did not make it clear, I do *not* like the idea of spending a lot of time and effort building tools directly on top of compiled methods. This ties a non-negligible amount of the image to a specific bytecode set and representation and could create considerable inertia to moving to new VM representations, new bytecode sets, native compilation,... if a lot of this must be recreated.
Just my $0.02.
-- Dwight
At 08:10 PM 10/24/98 -0500, Dwight Hughes wrote:
... I do *not* like the idea of spending a lot of time and effort building tools directly on top of compiled methods. This ties a non-negligible amount of the image to a specific bytecode set and representation and could create considerable inertia to moving to new VM representations, new bytecode sets, native compilation,... if a lot of this must be recreated.
I too would prefer minimizing dependencies on the current bytecode set. I've had a couple of alternative bytecode sets in mind for years. It might be nice to actually try them on a complete Smalltalk system like Squeak.
For example, you could huffman compress the current bytecodes. As the frequency of use is not at all even, the most common bytecodes could take only a few bits. Huffman compressed bytecode dispatching for an interpreter would not be much slower than for current bytecodes. You basically just do a table jump on some chunk of the bitcode stream. Redundant entries in the jump table essentially do the huffman decode as part of the table jump. Additional bytecode data could also be of optimal length, instead of being forced to a multiple of 8 bits. My guess is you could get significant compression of a Smalltalk image with this kind of bytecode set. For certain uses, size is rather important.
- Jan
___________________________________________________________________ Paradigm Matrix Inc., San Ramon California "video products and development services for Win32 platforms" Internet: Jan Bottorff janb@pmatrix.com WWW http://www.pmatrix.com Phone: voice (925) 803-9318 fax (925) 803-9397 PGP: public key http://www-swiss.ai.mit.edu/~bal/pks-toplev.html fingerprint 52 CB FF 60 91 25 F9 44 6F 87 23 C9 AB 5D 05 F6 ___________________________________________________________________
Here a link related to the idea to use compressed parse trees instead of byte codes :
http://www.ics.uci.edu/~oberon/research#SlimBinaries
The technique also known as "slim binaries" was presented in an article in Communications of the ACM last year (?).
Markus
squeak-dev@lists.squeakfoundation.org