Planning for Exupery 0.15

Fri Aug 8 21:38:57 UTC 2008

The two areas most in need of improvement before a 1.0 now are run
time performance and reliability. Hopefully 0.15 will lead to a decent
improvement in both. First runtime performance as the end of 0.14
involved a decent round of testing and debugging.

Here's some benchmarks:

  arithmaticLoopBenchmark  417 compiled  94 ratio: 4.436
  bytecodeBenchmark        725 compiled 262 ratio: 2.767
  sendBenchmark            692 compiled 403 ratio: 1.717
  doLoopsBenchmark         389 compiled 385 ratio: 1.010
  pointCreation            423 compiled 426 ratio: 0.993
  largeExplorers           198 compiled 199 ratio: 0.995
  compilerBenchmark        245 compiled 249 ratio: 0.984
  Cumulative Time          401 compiled 260 ratio 1.542

The primary goal is to improve the last two benchmarks, the
two macro benchmarks. Both benchmarks use a profiler to decide
what to compile, the goal is to compile enough methods to make
a difference reasonably quickly so the benchmark doesn't take
too long to run.

Here's the profile for compilerBenchmark:
CPU: Core 2, speed 3005.67 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        samples  %        image name               app name                 symbol name
4122385  62.5654  4860169  58.3687  squeak                   squeak                   interpret
447635    6.7937  715498    8.5928  anon (tgid:6321 range:0xb1c91000-0xb7bf0000) squeak                   (no symbols)
224375    3.4053  412666    4.9560  squeak                   squeak                   exuperyCreateContext
157809    2.3951  269117    3.2320  squeak                   squeak                   exuperyIsNativeContext
126427    1.9188  230642    2.7699  squeak                   squeak                   allocateheaderSizeh1h2h3doFillwith
96316     1.4618  107350    1.2892  squeak                   squeak                   sweepPhase
87506     1.3281  47304     0.5681  squeak                   squeak                   lookupMethodInClass
53014     0.8046  71782     0.8621  squeak                   squeak                   markAndTrace
52262     0.7932  84112     1.0102  squeak                   squeak                   exuperySetupMessageSend
51999     0.7892  76301     0.9163  squeak                   squeak                   exuperyCallMethod
50920     0.7728  84568     1.0156  squeak                   squeak                   instantiateContextsizeInBytes
47231     0.7168  31047     0.3729  no-vmlinux               no-vmlinux               (no symbols)
42841     0.6502  52130     0.6261  MiscPrimitivePlugin      MiscPrimitivePlugin      primitiveStringHash
42560     0.6459  77447     0.9301  squeak                   squeak                   activateNewMethod

Only 14% of the time is going into code compiled by Exupery and it's
helper functions. 62% of the time is still in the main interpreter
loop. Interestingly the ratio between time in native code and time in
exuperyCreateContext is the same as the send benchmark so it's likely
that the native code is mostly in send processing. Either being called
from interpreted code or sending to compiled code.  The native code is
executing 1.6 instructions per cycle, the CPU maxes out at 4
instructions per cycle.1.6 instructions per cycle would be excellent
for an Athlon but Cores are more efficient, it's still good though.

Half of the time spent in exuperySetupMessage send is going to
dispatching to unhandled primitives, the other half will be
going to sends to interpreted code.

There's a few obvious things to do to improve performance:
 * Implement more addressing modes
 * Natively compile calls to C primitives
 * Implement the ^true, ^false, and ^ nil primitives
 * Remove jumps to jumps.

Implementing more addressing modes looks the most promising. It should
speed up most of the benchmarks as all but the bytecode benchmark
spend significant time in code that suffers badly from a single
missing addressing mode especially object creation code and
send/return code. 

The current send optimisation is PICs which only work when sending
from compiled code to compiled code. Sends to and from interpreted
code are about the same speed or a little slower than interpreted to
interpreted sends. It's true that this can be avoided by compiling
more methods so most sends are compiled to compiled but it's much
easier to decide what to compile if compiling anything is likely
to lead to a speed improvement and not risk a speed loss.

Compiling the call to the primitive function into native code will
allow the primitives to be dispatched via PIC instead of needing to go
through exuperySetupMessageSend. Half of the calls to
exuperySetupMessageSend in the compiler benchmark are for primitives,
in the large explorers benchmark three quarters of the calls are for
primitives. That time will disappear. Evaluating blocks uses a
primitive send which takes a large proportion of the block dispatch
time. 

There's a handful of primitives that are implemented inside the main
interpret loop. ^ true, ^ false, and ^ nil are some of them. They
often show up when they fail to inline as Exupery can not yet compile
them. If code uses them, then compiling it will cause a large time
loss due to using a full primitive dispatch compared with the
interpreter. Given how simple they are implementing them makes sense.

Exupery can create code that jumps directly to an unconditional
jump. This does happen in some inner loops. The jumps should be
modified to go to the target jump's destination. Jumping to a jump
makes the CPU's front end's life difficult. In the compiler benchmark
only for 9% of the time are the reservation stations full, which
indicates that for most of the time the front end can not keep up with
instruction execution.

Here's an example of the kind of code that's commonly generated with
addressing mode problems. This example is from the method return
sequence. Every compiled method goes through a block like this when
returning:
  (block24
    (mov #nilObj eax)
    (mov (eax) eax)
    (mov eax (8 ecx))
    (mov #activeContext eax)
    (mov ebx (eax))
    (mov #youngStart eax)
    (mov #activeContext ebx)
    (mov (ebx) ebx)
    (cmp (eax) ebx)
    (jumpUnsignedGreaterEqualThan block25)
    (mov #activeContext eax)
    (mov (eax) eax)
    (mov (eax) ebx)
    (mov 1073741824 eax)
    (and ebx eax)
    (jnz block25)
    (mov 2400 eax)
    (mov #rootTableCount ecx)
    (cmp eax (ecx))
    (jumpSignedGreaterEqualThan block26)
    (jmp block27)
   )

The problem is instructions like "(mov #nilObj eax)" the address
should be encoded in the memory access that uses it. There's no
need to move an address into a register before using it. There's
other problems besides not handling literal indirect addressing but
the literal indirect problem is the largest.by a long shot.

I'm going to add literal indirect addressing first as it's harder to
estimate what it'll do to overall performance but it is a problem for
almost all the benchmarks.

It would also be worthwhile improving the profiling tools. It should
be relatively easy to get oprofile to show the compiled method names
instead of lumping all compiled code into the "anon" memory bucket.
It would also be worthwhile and easy to write some code to read the
oprofile files and compute the ratios rather than calculate them by
hand. 

Bryce