Interpreter speedup possible?

Fri Aug 11 15:54:30 UTC 2000

Henrik Gedenryd <Henrik.Gedenryd at lucs.lu.se> wrote...
>The following described the Dolphin interpreter:
>
>> Another of our most dramatic performance improvements was achieved by
>> dispatching independently from each instruction rather than from a main
>> interpreter loop (threaded dispatch). This actually has very little effect
>> on the code size, as at most four instructions are required. The
>> implementation of many bytecodes is very short, and the dispatch
>> instructions can often be intermixed with the bytecode's instructions in
>> such a way that the overhead is reduced to the cost of the jump.
>
>Could this have any significant impact on Squeak? It would seem relatively
>easy to trick this with the Slang inliner.

Henrik -

We are aware of this technique (have used it several times before).  If you look at the code we generate right now, we do copy the IP (instruction pointer) increment and fetch of the next bytecode at the end of every bytecode service routine.  John Malone coded up a test case where we also copied the dispatch itself and it made very little difference (at least on the PPC).  This is because the PPC (and most modern processors) can execute jumps under the shadow of a memory fetch and, even if it's all in-lined, it can't start the bytecode dispatch until the IP increment and fetch is complete.  It might well be that, on some processors (especially older ones), copying the dispatch as well would speed things up.

There is almost no end to the tricks you can play with bytecode interpreters.  How about this one:  You make four copies of the interpreter.  One of them fetches the next 4 bytes out of the instruction stream as a single word, and the other three merely shift and mask this word instead of having to fetch it from memory and increment the IP.  Naturally, the dispatch sequence for the first interpreter dispatches into the second interpreter, the second into the third and so on around to the first again.  Only one in every four executions has to pick up a bytecode from memory (except for jumps and sends).  This one wastes space, so it is sensitive to the cache size.

	- Dan