Optimizing Squeak

Wed Feb 24 19:14:49 UTC 1999

Dan Ingalls wrote:

> For those who are following this, I'd like to make one fine point:
>
> One can structure a bytecode interpreter in exactly the above form, so in that sense you can eliminate the interpreter loop with bytecodes as well.  The real difference between bytecodes and threaded code is that a bytecode interpreter has to look the current bytecode up in a table to find the opcode address, whereas a threaded interpreter has the address in hand from the instruction fetch.

> While there is no practical way around the table lookup (*),

(*) Well, actually there are ways:  If the most frequent opcodes can be coded in, say, 8 words, then you can just shift the bytecode to get an address offset into the block of opcode service routines.  Any routines that don't fit can include a jump out of line where there is more space.

Fundamentally where do the speed-ups come from?

a) the dispatch of direct threaded code being approximately twice as fast as bytecode dispatch because a table lookup is eliminated (this has been well discussed).  As Dan pointed-out in his footnote one can do something analogous with bytecode dispatch by aligning bytecode routines on a boundary.  However, one still pays for two arithmetic operations (a shift and an add), and will probably end up dedicating a register to the base address of the bytecode routines.  This register is extremely expensive on instruction architectures like the x86.

but b) the real advantage comes from threaded code having 4 or 8 times the information content of a bytecode.  i.e. in the threaded code designs I vaguely know (Peter Deutsch did a threaded code prototype before doing HPS, the current VisualWorks engine that generates native code; BrouHaHa-TC, my own direct threaded code engine, Ian Piumarta's Jitter) each bytecode is expanded to two 32-bit words.  Hence one has 64-bits per operation instead of 8.  These extra bits are very profitably used to cache the previous method lookup for each send bytecode.  A bytecode send specifies an argument count and a selector index.
Hence to do a bytecoded send one must:
    index the stack with the argument count to derive the receiver
    index the method with the selector index to derive the selector
    derive the class of the receiver (which involves tag tests, etc)
    derive the base address of the method lookup cache
    compute a hash of the class and selector (e.g. add their raw addresses together with the cache base address)
    fetch and compare both the class and the selector from the cache against the class of the receiver and the send's selector
    if they match, fetch the method from the cache and activate it (this last bit is also expensive, since the method header must be decoded)

Using e.g. threaded code a send bytecode is typically translated into two words where the first word points to a send routine that knows it has N arguments, and the second word is the selector.  Read-ahead means the selector is really cheap to fetch from the word following the threaded code.  Note that both the threaded code address and the selector are fetched using data access. Once the send has been looked-up and translated to threaded-code (the method->threaded-code method conversion is cached so one doesn't do this every time except in Jitter I, one of the reasons for its poor performance) one replaces the second
word (the selector) with a reference to the threaded-code method and the first word to a routine that simply compares the class of the receiver against that stored in the header of the threaded-code method.  i.e.
    index the stack with a constant offset to derive the receiver
    derive its class (same unavoidable tag test)
    fetch the new method
    index this address with a constant offset to derive the class
    compare the class against that of the receiver
    if they match, set the threaded-code instruction pointer to point at the prolog routine of the method and continue
    (hence one can have a range of prolog instructions, say 8, which implicitly decode various kinds of method prologs,
    such as has arguments and temps, has no args but temps, etc)

and c) using native code also provides more information, allowing analogs of the b) approach, but gets the processor to do all the decode and dispatch, and can use its branch-prediction logic to pre-fetch the prolog.  With e.g. HPS a send is translated into two instructions, one that loads a register with the selector and another that calls a lookup routine that knows the number of arguments.  When looked-up these are replaced by an instruction which loads a register with the class of the receiver when the send was looked-up, and a direct call of the native method for that lookup.  The native method's prolog then
derives the receiver's class (and the receiver is passed in a register, not on the stack) and compares it with the class/selector register and if they match continues.  The native method's prolog is specific to the particular method, so there's no method decode overhead at all.

So IMO the fundamental reasons why threaded-code and native code approaches are "better" are that

1. threaded-code and native code approaches use more bits per send/per prolog which allows them to implement sends and prologs as memo functions that elide costly operations in the common case.  Since typical programs exhibit huge locality implementing the conversion as a cached translation to a subset of all the methods in the system reclaims a lot of the space trade-off that one has made.

2. native code uses the processor's decode logic, i.e. parallelism.  But one can imagine that super-scalar processors with large indirect branch prediction tables might elide the threaded-code dispatch.

In both Peter Deutsch's and my experience a tuned threaded-code implementation will run at about 2/3 the speed of a tuned native code implementation for macro benchmarks, but at around 1/4 for micro benchmarks.  Tuned bytecode approaches are a further factor of 1/2 behind threaded code.

But whereas a direct threaded-code implementation can be written to be portable in a matter of hours a native code implementation is portable in a matter of months.  And on small machines the gap between threaded-code and native code narrows because there's less instruction cache for the native code to take advantage of.

One last point; native code requires instruction cache flushing when relinking sends on many processors.  It used to be the case that one could write extremely efficient cache flushing logic in the form of a jump table.  One would pre-generate a jump table big enough to shadow the icache and then jump into it at the point that matched the send instructions being flushed.  Hence to flush the cache on would simply execute a few nops and a pair of jumps (e.g. this works for 68020 & 68030 processors).  icache designs are now much more complex (e.g. the 68040 has a 4 way set associative icache with a random replacement
policy) meaning that one has to use the official route.

[snip]

> Another form of optimization unique to bytcode interpreters is the use of multiple dispatch tables.  So, for instance, at the end of an arithmetic bytecode, one might look the next bytecode up in a table of routines that assume the top of stack to be a small integer.  We did this for small integers, true and false in one of the Apple ST 68k interpreters.

Interesting.  This is an idea I've not heard before.  What kind of speed-ups did it achieve?

_______________,,,^..^,,,_______________
Eliot Miranda, ParcPlace