Progrmaming in Bytecode?

Fri Aug 2 01:49:01 UTC 2002

On Thu, 1 Aug 2002, Swan, Dean wrote:

> >The usual cause for "padding" in (capable) assemblers (other than for
> >prefetch alignment, as you point out) is due to dynamic linking where the
> 
>   Not only prefetch alignment.  I have heard, although I don't know
>   the details that sometimes it's desirable to force a dead cycle in
>   the execution pipeline.  I can only guess that this would be to
>   influence the scheduling of the execution units for complex
>   instructions, but I consider myself to be only a "casual observer"
>   with respect to this kind of thing.

I'm by no means an expert either but as far as I understand it higher
execution unit utilisation on superscalar processors (e.g., PPC) comes
from avoiding data dependencies: grouping together insns that have inputs
independent of outputs, avoiding adjacent load/store, and respecting
latencies (e.g., load-use and moves to/from "special" registers).  
Introducing NOPs doesn't really help here.

Introducing extra NOPs is more important w.r.t. throughput in the insn
fetch buffer (which ranges from 1 to 8 insns wide on PPC) when there's a
control dependency (a branch target isn't yet known).  In this case (at
least on the PPC) introducing NOPs to ensure that branches occur in the
last word of the prefetch buffer (and that targets are in the first word)
reduces the number of insns fetched (an entire buffer is fetched at once)
that are either never executed or that might be cancelled after
speculative execution when the branch condition is finally known.  This
helps maximise the number of fetched insns that are actually executed.

>   Well, to kind of tie this back into something relevant to
>   Squeak - I know a few people have talked about "new" VM
>   instruction sets.  It seems that it might help to think
>   about a VM where all instructions are the same size.
> 
>   This could simplify code generation (i.e. compiling methods),
>   and possibly improve performance and *reduce* code size (at
>   least in the case of long methods).

Now I think it's you who's being the optimist. ;)  I'm sure that
increasing the size of the "operand" fields (selector/variable indices)
will increase the overall size of code (the "compact" encodings are there
precisely to pack the common cases into as few bytes as possible).  The
only time you'd win are when there are a lot of out-of-range branches (to
or around other branches), but my intution tells me you'd have to work
hard to concoct a method where branch density was high enough to offset
the space you'd lose in other bytecodes from giving up the highly compact
encoding that Squeak has at the moment.

>   The anticipated performance
>   improvement would come from forcing aligned fetches more often

I'm not sure alignment is really an issue, since bytecodes are (by
definition ;) a byte wide.  (This would only have an impact on bizarre
architectures, like certain so-called "supercomputers", that don't have
byte-addressable memory at all.)  You _would_ gain in reduced memory
traffic (avoiding the additional fetches of operand bytes) but the working
set size (and so cache pressure) would also increase.  The optimal balance
between the last two depends entirely on the physical resources (and
memory speeds) of each individual machine, and on the behaviour of each
particular application, so I think it's hard to predict whether you'd win
or lose in general.

>   and potentially simpler instruction decoding.

Absolutely.

FWIW, all of the Jitters did a pre-pass over the bytecode to convert
everthing into a "normal" form (a total of 20 or so "abstract insns", all
of which were the same size and had the "opcode" in the same place) and to
eliminate the convoluted "conditional jumps over unconditional jumps".

Like you say, this simplified compilation immensely.  Whether or not (or
on which architectures and under which conditions) it would increase
interpreted performance would make for a fascinating experiment.

Markus has been talking about making parse trees be the "portable form" of
compiled methods and then using a runtime compiler to convert them into
bytecodes (for interpretation) or native code.  Experimenting with
different formats of bytecode (or "wordcode") would be really easy (and
lots of fun) in such a context.

>   It's really nice to work with an architecture where all instructions
>   take up one word of memory, take one cyle to execute, and sometimes
>   can cause five or six different useful things to occur during that
>   one cycle.  (ok, so I write a lot of DSP code...)

I write lots of PPC code, and with its combination of parallel execution
and certain insns (like the masking/inserting shifts/rotates, not to
mention the tricks you can do having a full range of logical operators on
the eight independent condition code registers) that cook you an entire
meal in one cycle, the issues are pretty similar.

>   The single cycle
>   thing isn't necessarily relevant for Smalltalk,  but those "extended"
>   bytecodes are really distasteful to me.

I find them distasteful too.  I also find the multiple object header
formats equally distasteful.  OTOH, every little bit counts when trying to
minimise the size of the image -- and there a more than a few people
putting Squeak to work on severely limited machines.

Ian