[Vm-dev] Cog and Exupery

Wed Mar 4 22:47:21 UTC 2009

Eliot Miranda writes:
 >  On Sun, Mar 1, 2009 at 1:46 PM, <bryce at kampjes.demon.co.uk> wrote:
 > I've thought about using a context stack several times over the years.
 > > The key benefits are faster returns
 > 
 > and much faster sends

Well sends with few arguments could be optimised without a context
stack. Pass the arguments in registers, getting a context from the
list is quick, and the PIC uses an unconditional jump rather than a
call.

It's even possible to avoid adjusting the size of the stack frame in
most cases by allocating a frame big enough for most contexts. Exupery
only uses the stack frame for spilling internal temporary registers so
it's use tends to be small. Stack variables spill directly into the
context. (1)

 > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > >  > Smalltalk-intensive benchmarks some time later this year.
 > >
 > > My original and current aim is double VW's performance or be roughly
 > > equivalent to C.
 > 
 > 
 > And what's the state of this effort?  What metrics lead you to believe you
 > can double VW's performance?  Where are you currently?  What do you mean by
 > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
 > compiler?  What benchmarks have you focussed on?

Exupery is fairly stable and providing modest gains but needs more
tuning before it's performing as it should. Effort could now go into
either tuning the current engine so it performs as it should or adding
full method inlining. Until recently reliability was the main
show-stopper to actual use. There's still plenty of areas to optimise
to improve both compile time and run time.

Here's the current benchmarks:

  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
  bytecodeBenchmark 724 compiled 250 ratio: 2.895
  sendBenchmark 663 compiled 385 ratio: 1.722
  doLoopsBenchmark 381 compiled 235 ratio: 1.621
  pointCreation 394 compiled 389 ratio: 1.013
  largeExplorers 269 compiled 210 ratio: 1.280
  compilerBenchmark 273 compiled 250 ratio: 1.092
  Cumulative Time 413.408 compiled 232.706 ratio 1.777

  ExuperyBenchmarks>>arithmeticLoop 103ms
  SmallInteger>>benchmark 341ms
  InstructionStream>>interpretExtension:in:for: 6069ms
  Average 597.360

largeExplorers and compilerBenchmark are both real code. They do vary
depending on what the profiler decides to profile. The rest are micro
benchmarks. The benchmark suite looks much worse since upgrading to
a Core 2 which is very efficient at running the interpreter.

Optimisation so far has been about removing gross inefficiencies. I
avoid any optimisation that's likely to make other cases worse. e.g.
the last two optimisations were calling all primitives from native
code rather than just the few that Exupery compiles and adding
indirect literal addressing used to access VM variables.

What do I mean by equivalent to C? Basically being able to fully
optimise predictable code and optimise unpredictable code enough that
the remaining inefficiencies are hidden behind L2/L3 cache misses and
branch mispredicts. C was executing around 1 instruction per clock
measured from system code, that leaves some room to hid inefficiencies.

Why do I believe it's possible? With dynamic type information we can
optimise for the common case, and after inlining loops we can often
pull the type checks out of the loops. Then it's just a case of
competing optimiser to optimiser but most of the gains for most code
is likely to come from a small set of optimisations.

I've also played around with a few thought experiments including
looking at what it would take to compile a dot product with zero
overhead inside the loops.

Bryce

(1) I really should move temps and arguments into registers
too. Currently only stack values are held in registers. Moving
more variables into registers only made sense after spilling
directly into the context and reducing interference in the
register allocator.