Optimizing Squeak

Mon Feb 22 21:15:06 UTC 1999

>>Bytecode interpreter:
>>One thing costly on modern microprocessors is branch
>>mispredictions.  Since a bytecode interpreter calls
>>or jumps to different code when dispatching each bytecode,
>>it is likely to get a misprediction per bytecode.
>
>Not if the bytecode is used to index into a 256 element jump-table.  In
this case there is no need for an array-bounds check and there are no
conditional branches so no branch mispredictions can occur.

The issue is not array bounds checking. The issue is the processor pipleine
uses the address of the branch instruction (usually an unconditional jump
to a table indexed address) to "predict" what address a branch will go to.
As the branch will go to 256 different places, depending on the bytecode
value, the prediction will be mostly wrong. Really, instead of just using
the branch instruction as a hash to some destination table, actually doing
the branch effective address calculation as early as possible helps. It was
my understansing the PPC processors were much better about this than the
Intel procesors

On a Pentium II, a mispredicted branch means the processor has to throw
away all the instructions it so cleaverly fetched and executed, thinking it
knew where to go. As the processor execution stream can be out of order up
to about 40 instructions, it may have to throw away all the work done for
those 40 instructions, and restart the execution pipeline at the correct
branch address. 

The driving on a curved road analogy works pretty well here. The processor
can ONLY do 200 MPH. So you turn the first curve (bytecode dispatch). The
processor thinks it knows where your going, so zooms way out past the
curve. Eventually, all that driving past the curve has to be thrown away,
as you make you way back onto the road and turn for the new curve (bytecode
dispatch), only to have the processor once again shoot way past the curve
(like as much as 40 instructions past). Over time, you find your only
spending 30% of the time on the road, and 70% driving around on the dirt,
shooting off or getting back on the road.

If it weren't for the much high clock speeds, current Intel processors
would execute byetcodes slower than older simpler Intel processors, like
the 486. The branch prediction really does help run the C code that make up
the primitives. You get nice things like not getting execution stalled
because you have a memory cache miss (which can take like 50 processor
cycles if not hidden by out of order execution).

I very much agree that even a slow interpreter can run programs fast, if
you have lots of appropriate primitives implements in C. This is exactly
what languages like VisualBasic do. My guess is many Smalltalkers would be
upset if things like Collections where implemented as primitives. Even
though you never change that code, it's somehow better to have the
Smalltalk code there to look at.

Mabey a good performance jump could be attained by making many of the
standard classes compatable with the Smalltalk to C translator, and then
just compiling them into the VM. Mabey as part of an application deployment
process.

- Jan
___________________________________________________________________
            Paradigm Matrix Inc., San Ramon California
   "video products and development services for Win32 platforms"
Internet: Jan Bottorff janb at pmatrix.com
          WWW          http://www.pmatrix.com
Phone: voice  (925) 803-9318
       fax    (925) 803-9397
PGP: public key  <http://www-swiss.ai.mit.edu/~bal/pks-toplev.html>
     fingerprint  52 CB FF 60 91 25 F9 44  6F 87 23 C9 AB 5D 05 F6
___________________________________________________________________