Interpreter>>pushReceiverVariableBytecode

Sat Sep 7 01:07:17 UTC 2002

Ian Piumarta wrote:

[lots of details on manually scheduling interpret to fetch the bytecode 
ahead of time deleted]

All true for most (if not all?) modern processors, but it's really only 
true because/if the compilers aren't clever enough.

It's possible for some architectures that I-cache pressure (ie, code 
size) might matter more, in which case this is a loss.

>Because the speedup is measurable, _significantly_ so when using gcc in
>which case the final "break" in each bytecode is converted (manually, by
>an awk script run on the interp.c file) into an explicit dispatch directly
>to the next bytecode's case label
>
>   void *bytecodeDisatchTable[256] = { &label0, ..., &label255 };
>   ...
>   case N: labelN:
>           bytecode= fetchNextBytecode();
>           doTheWork();
>           goto *bytecodeDispatchTable[currentBytecode]; /* break */
>
>which entirely eliminates the interpreter's dispatch loop.
>
This is a standard GCC interpreter trick, but I _really_ think this is a 
loss for the ARM because the switch is actually implemented very 
efficiently on ARM with just one instruction using PC relative 
addressing(*).  Once you use computed goto's you need to hold 
bytecodeDispatchTable in a register or (worse still) load the constand 
each time. Does the saving of one (unconditional) jump back to the 
interpreter loop pay off the added register pressure?
For the ARM, fetching the bytecode ahead of time barely hurts code size 
and is probably a win or a wash.

Alas, I haven't had time to perform any detailed meassurements to answer 
these questions.

/Tommy
(*): GCC also generates a useless bytecode < 256 comparison that should 
be eliminated.