[squeak-dev] Re: jitter (was: The Old Man)

Thu Apr 3 21:55:59 UTC 2008

Andreas Raab writes:
 > bryce at kampjes.demon.co.uk wrote:
 > Indeed. So what's in the way of practical performance improvement at 
 > this point? I was quite surprised that in your corrected benchmarks the 
 > two that were macros wouldn't show any improvement:
 >    bytecodeBenchmark       2111 compiled  460 ratio:  4.589
 >    sendBenchmark           1637 compiled  668 ratio:  2.451
 >    [...]
 >    largeExplorers           728 compiled  715 ratio:  1.018
 >    compilerBenchmark        483 compiled  489 ratio:  0.988
 > 
 > With sends 2.5x faster I would expect *some* noticable improvement. Any 
 > ideas what the problem is?

Here's the latest benchmarks, there's a 10% gain for the compiler
benchmark. There's a bigger loss for largeExplorers but I think that's
triggered by compiling more methods and thus hitting a missing
optimisation that previous benchmark runs didn't hit.

   arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122
   bytecodeBenchmark 2183 compiled 435 ratio: 5.017
   sendBenchmark 1657 compiled 741 ratio: 2.236
   doLoopsBenchmark 1100 compiled 813 ratio: 1.353
   pointCreation 988 compiled 968 ratio: 1.021
   largeExplorers 729 compiled 780 ratio: 0.935
   compilerBenchmark 529 compiled 480 ratio: *1.102*
   Cumulative Time 1113.161 compiled 538.355 ratio 2.068

   ExuperyBenchmarks>>arithmeticLoop 199ms
   SmallInteger>>benchmark 791ms
   InstructionStream>>interpretExtension:in:for: 14266ms
   Average 1309.515

There's many reasons why there isn't a larger gain. One is only a few
methods are being compiled. There was a register allocation
issue. String>>at: and at:put: are not yet compiled so are taking the
slow path through to the interpreter. The same with ^ true, ^ false,
and ^ nil. All are used by those macro benchmarks.

The problem I'm working on now is the stack is being loaded into
registers when Exupery enters a context then saved back into the
context when it leaves. This is particularly inefficient if the
registers get spilled to the C stack as then they're copied into
registers to be immediately copied to memory at a different location.
This makes real case sends potentially worse than the send benchmark.

With earlier benchmarks compilation was slowing down object allocation
because allocation is inlined into the main interpreter loop but
Exupery was doing a full worst case send to get to the primitive. (1)

To benefit from PICs both the sending and receiving methods must be
compiled. That may not be happening as much as it should be. This can
be blocked by some of the few missing bytecodes. Stack duplication is
the only serious missing bytecode.

The generated code is relatively sloppy still. Exupery doesn't handle
SIB byte addressing modes so it can't access literal indirections
which are used heavily to access interpreter state. Temporaries are
not stored in registers, that's until after the stack register
improvements are finished.

The visible progress since the last release is the compiler benchmark
now shows a 10% gain and compiling interpretExtension:in:for: is about
9 times faster.

Bryce

(1) It wasn't creating the context but it was going through the PIC
then dropping into a helper method which ran through the interpreter's
send code.