On Wed, Mar 4, 2009 at 2:47 PM, <bryce@kampjes.demon.co.uk> wrote:

Eliot Miranda writes:
 >  On Sun, Mar 1, 2009 at 1:46 PM, <bryce@kampjes.demon.co.uk> wrote:
 > I've thought about using a context stack several times over the years.
 > > The key benefits are faster returns
 > and much faster sends

Well sends with few arguments could be optimised without a context
stack. Pass the arguments in registers, getting a context from the
list is quick, and the PIC uses an unconditional jump rather than a

Getting anything off a list is not quick compared with a call.  taking things off a list involves a write.  Calls only write a return address to the stack (and then only on CISCs) which is well optimized by current processors.

PICs don't eliminate calls.  A send is a call whether through a PIC or not.  One calls the PIC (if only to collect a return address) and the PIC jumps.  There's still always a call.

Do some measurements on modern processors and see if there are significant differences between call and unconditional jump.  A call is an unconditional jump plus the pushing of a return address.  When I last looked at this carefully I found there was no significant difference in performance unless one was being deeply recursive.  i.e. attempting to use unconditional jumps to save performance, e.g. by inlining accessors, gained nothing because the processor was taking care to make calls and returns of the same cost as unconditional jumps (this on x86 in the 2004 timeframe).

It's even possible to avoid adjusting the size of the stack frame in
most cases by allocating a frame big enough for most contexts. Exupery
only uses the stack frame for spilling internal temporary registers so
it's use tends to be small. Stack variables spill directly into the
context. (1)

I don't understand the implication here.  With a stack the frame size is essentially irrelevant.  With a context it certainly is not.  If all contexts are large then one eats through memory faster and pays the price.

 > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > >  > Smalltalk-intensive benchmarks some time later this year.
 > >
 > > My original and current aim is double VW's performance or be roughly
 > > equivalent to C.
 > And what's the state of this effort?  What metrics lead you to believe you
 > can double VW's performance?  Where are you currently?  What do you mean by
 > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
 > compiler?  What benchmarks have you focussed on?

Exupery is fairly stable and providing modest gains but needs more
tuning before it's performing as it should. Effort could now go into
either tuning the current engine so it performs as it should or adding
full method inlining. Until recently reliability was the main
show-stopper to actual use. There's still plenty of areas to optimise
to improve both compile time and run time.

Here's the current benchmarks:

 arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
 bytecodeBenchmark 724 compiled 250 ratio: 2.895
 sendBenchmark 663 compiled 385 ratio: 1.722
 doLoopsBenchmark 381 compiled 235 ratio: 1.621
 pointCreation 394 compiled 389 ratio: 1.013
 largeExplorers 269 compiled 210 ratio: 1.280
 compilerBenchmark 273 compiled 250 ratio: 1.092
 Cumulative Time 413.408 compiled 232.706 ratio 1.777

 ExuperyBenchmarks>>arithmeticLoop 103ms
 SmallInteger>>benchmark 341ms
 InstructionStream>>interpretExtension:in:for: 6069ms
 Average 597.360

largeExplorers and compilerBenchmark are both real code. They do vary
depending on what the profiler decides to profile. The rest are micro
benchmarks. The benchmark suite looks much worse since upgrading to
a Core 2 which is very efficient at running the interpreter.

Which package version and repository are you taking the benchmarks from?  I'd like to run them in Cog.  Would you be interested in running the compiler language shootout benchmarks I'm using?

Optimisation so far has been about removing gross inefficiencies. I
avoid any optimisation that's likely to make other cases worse. e.g.
the last two optimisations were calling all primitives from native
code rather than just the few that Exupery compiles and adding
indirect literal addressing used to access VM variables.

What do I mean by equivalent to C? Basically being able to fully
optimise predictable code and optimise unpredictable code enough that
the remaining inefficiencies are hidden behind L2/L3 cache misses and
branch mispredicts. C was executing around 1 instruction per clock
measured from system code, that leaves some room to hid inefficiencies.

Why do I believe it's possible? With dynamic type information we can
optimise for the common case, and after inlining loops we can often
pull the type checks out of the loops. Then it's just a case of
competing optimiser to optimiser but most of the gains for most code
is likely to come from a small set of optimisations.

I've also played around with a few thought experiments including
looking at what it would take to compile a dot product with zero
overhead inside the loops.

I'll ask again ;)  "What metrics lead you to believe you can double VW's performance?"


(1) I really should move temps and arguments into registers
too. Currently only stack values are held in registers. Moving
more variables into registers only made sense after spilling
directly into the context and reducing interference in the
register allocator.
