On Thu, 1 Aug 2002, Swan, Dean wrote:
The usual cause for "padding" in (capable) assemblers (other than for prefetch alignment, as you point out) is due to dynamic linking where the
Not only prefetch alignment. I have heard, although I don't know the details that sometimes it's desirable to force a dead cycle in the execution pipeline. I can only guess that this would be to influence the scheduling of the execution units for complex instructions, but I consider myself to be only a "casual observer" with respect to this kind of thing.
I'm by no means an expert either but as far as I understand it higher execution unit utilisation on superscalar processors (e.g., PPC) comes from avoiding data dependencies: grouping together insns that have inputs independent of outputs, avoiding adjacent load/store, and respecting latencies (e.g., load-use and moves to/from "special" registers). Introducing NOPs doesn't really help here.
Introducing extra NOPs is more important w.r.t. throughput in the insn fetch buffer (which ranges from 1 to 8 insns wide on PPC) when there's a control dependency (a branch target isn't yet known). In this case (at least on the PPC) introducing NOPs to ensure that branches occur in the last word of the prefetch buffer (and that targets are in the first word) reduces the number of insns fetched (an entire buffer is fetched at once) that are either never executed or that might be cancelled after speculative execution when the branch condition is finally known. This helps maximise the number of fetched insns that are actually executed.
Well, to kind of tie this back into something relevant to Squeak - I know a few people have talked about "new" VM instruction sets. It seems that it might help to think about a VM where all instructions are the same size.
This could simplify code generation (i.e. compiling methods), and possibly improve performance and *reduce* code size (at least in the case of long methods).
Now I think it's you who's being the optimist. ;) I'm sure that increasing the size of the "operand" fields (selector/variable indices) will increase the overall size of code (the "compact" encodings are there precisely to pack the common cases into as few bytes as possible). The only time you'd win are when there are a lot of out-of-range branches (to or around other branches), but my intution tells me you'd have to work hard to concoct a method where branch density was high enough to offset the space you'd lose in other bytecodes from giving up the highly compact encoding that Squeak has at the moment.
The anticipated performance improvement would come from forcing aligned fetches more often
I'm not sure alignment is really an issue, since bytecodes are (by definition ;) a byte wide. (This would only have an impact on bizarre architectures, like certain so-called "supercomputers", that don't have byte-addressable memory at all.) You _would_ gain in reduced memory traffic (avoiding the additional fetches of operand bytes) but the working set size (and so cache pressure) would also increase. The optimal balance between the last two depends entirely on the physical resources (and memory speeds) of each individual machine, and on the behaviour of each particular application, so I think it's hard to predict whether you'd win or lose in general.
and potentially simpler instruction decoding.
Absolutely.
FWIW, all of the Jitters did a pre-pass over the bytecode to convert everthing into a "normal" form (a total of 20 or so "abstract insns", all of which were the same size and had the "opcode" in the same place) and to eliminate the convoluted "conditional jumps over unconditional jumps".
Like you say, this simplified compilation immensely. Whether or not (or on which architectures and under which conditions) it would increase interpreted performance would make for a fascinating experiment.
Markus has been talking about making parse trees be the "portable form" of compiled methods and then using a runtime compiler to convert them into bytecodes (for interpretation) or native code. Experimenting with different formats of bytecode (or "wordcode") would be really easy (and lots of fun) in such a context.
It's really nice to work with an architecture where all instructions take up one word of memory, take one cyle to execute, and sometimes can cause five or six different useful things to occur during that one cycle. (ok, so I write a lot of DSP code...)
I write lots of PPC code, and with its combination of parallel execution and certain insns (like the masking/inserting shifts/rotates, not to mention the tricks you can do having a full range of logical operators on the eight independent condition code registers) that cook you an entire meal in one cycle, the issues are pretty similar.
The single cycle thing isn't necessarily relevant for Smalltalk, but those "extended" bytecodes are really distasteful to me.
I find them distasteful too. I also find the multiple object header formats equally distasteful. OTOH, every little bit counts when trying to minimise the size of the image -- and there a more than a few people putting Squeak to work on severely limited machines.
Ian