Optimizing Squeak

Wed Feb 24 18:18:24 UTC 1999

Ian Piumarta <Ian.Piumarta at inria.fr> wrote...
>The crux of threaded code is that the interpreter loop goes away
>entirely.  Each "opcode" ends by dispatching to the next opcode, so
>there is no return to a central loop -- e.g:
>
>	#define DISPATCH() { \
>	  register void *nextOp= (void *)fetchWord(); \
>	  goto *nextOp; \
>	}
>
>	pushReceiver:
>	  internalPush(receiver);
>	  DISPATCH();
>
>	pushTrue:
>	  internalPush(trueObj);
>	  DISPATCH();
>
>	pushFalse:
>	  internalPush(falseObj);
>	  DISPATCH();
>
>and so on.  (The addresses of the labels are the "opcodes" used to represent a program.)

For those who are following this, I'd like to make one fine point:

One can structure a bytecode interpreter in exactly the above form, so in that sense you can eliminate the interpreter loop with bytecodes as well.  The real difference between bytecodes and threaded code is that a bytecode interpreter has to look the current bytecode up in a table to find the opcode address, whereas a threaded interpreter has the address in hand from the instruction fetch.

While there is no practical way around the table lookup (*), there ARE a couple of further tweaks that we were going to try in the absence of Jitter.  One is to reduce redundant fetches of multiple bytecodes from the same 32-bit word in a method.  There are several ways to do this, but the idea is to keep the full 32 bits from the first fetch in a register, and then subsequent bytecodes can eschew the memory reference in favor of a simple shift and mask.  The challenge is to keep the shift count management overhead less than the memory reference that is being saved.  We have done this in microcode with good results.

Another form of optimization unique to bytcode interpreters is the use of multiple dispatch tables.  So, for instance, at the end of an arithmetic bytecode, one might look the next bytecode up in a table of routines that assume the top of stack to be a small integer.  We did this for small integers, true and false in one of the Apple ST 68k interpreters.

	- Dan

(*) Well, actually there are ways:  If the most frequent opcodes can be coded in, say, 8 words, then you can just shift the bytecode to get an address offset into the block of opcode service routines.  Any routines that don't fit can include a jump out of line where there is more space.