[Vm-dev] Cog and Exupery

Thu Mar 5 00:08:46 UTC 2009

On Wed, Mar 4, 2009 at 2:47 PM, <bryce at kampjes.demon.co.uk> wrote:

>
> Eliot Miranda writes:
>  >  On Sun, Mar 1, 2009 at 1:46 PM, <bryce at kampjes.demon.co.uk> wrote:
>  > I've thought about using a context stack several times over the years.
>  > > The key benefits are faster returns
>  >
>  > and much faster sends
>
> Well sends with few arguments could be optimised without a context
> stack. Pass the arguments in registers, getting a context from the
> list is quick, and the PIC uses an unconditional jump rather than a
> call.

Getting anything off a list is not quick compared with a call.  taking
things off a list involves a write.  Calls only write a return address to
the stack (and then only on CISCs) which is well optimized by current
processors.

PICs don't eliminate calls.  A send is a call whether through a PIC or not.
 One calls the PIC (if only to collect a return address) and the PIC jumps.
 There's still always a call.

Do some measurements on modern processors and see if there are significant
differences between call and unconditional jump.  A call is an unconditional
jump plus the pushing of a return address.  When I last looked at this
carefully I found there was no significant difference in performance unless
one was being deeply recursive.  i.e. attempting to use unconditional jumps
to save performance, e.g. by inlining accessors, gained nothing because the
processor was taking care to make calls and returns of the same cost as
unconditional jumps (this on x86 in the 2004 timeframe).

It's even possible to avoid adjusting the size of the stack frame in
> most cases by allocating a frame big enough for most contexts. Exupery
> only uses the stack frame for spilling internal temporary registers so
> it's use tends to be small. Stack variables spill directly into the
> context. (1)

I don't understand the implication here.  With a stack the frame size is
essentially irrelevant.  With a context it certainly is not.  If all
contexts are large then one eats through memory faster and pays the price.

 > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
>  > >  > Smalltalk-intensive benchmarks some time later this year.
>  > >
>  > > My original and current aim is double VW's performance or be roughly
>  > > equivalent to C.
>  >
>  >
>  > And what's the state of this effort?  What metrics lead you to believe
> you
>  > can double VW's performance?  Where are you currently?  What do you mean
> by
>  > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
>  > compiler?  What benchmarks have you focussed on?
>
> Exupery is fairly stable and providing modest gains but needs more
> tuning before it's performing as it should. Effort could now go into
> either tuning the current engine so it performs as it should or adding
> full method inlining. Until recently reliability was the main
> show-stopper to actual use. There's still plenty of areas to optimise
> to improve both compile time and run time.
>
> Here's the current benchmarks:
>
>  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
>  bytecodeBenchmark 724 compiled 250 ratio: 2.895
>  sendBenchmark 663 compiled 385 ratio: 1.722
>  doLoopsBenchmark 381 compiled 235 ratio: 1.621
>  pointCreation 394 compiled 389 ratio: 1.013
>  largeExplorers 269 compiled 210 ratio: 1.280
>  compilerBenchmark 273 compiled 250 ratio: 1.092
>  Cumulative Time 413.408 compiled 232.706 ratio 1.777
>
>  ExuperyBenchmarks>>arithmeticLoop 103ms
>  SmallInteger>>benchmark 341ms
>  InstructionStream>>interpretExtension:in:for: 6069ms
>  Average 597.360
>
> largeExplorers and compilerBenchmark are both real code. They do vary
> depending on what the profiler decides to profile. The rest are micro
> benchmarks. The benchmark suite looks much worse since upgrading to
> a Core 2 which is very efficient at running the interpreter.

Which package version and repository are you taking the benchmarks from?
 I'd like to run them in Cog.  Would you be interested in running the
compiler language shootout benchmarks I'm using?

Optimisation so far has been about removing gross inefficiencies. I
> avoid any optimisation that's likely to make other cases worse. e.g.
> the last two optimisations were calling all primitives from native
> code rather than just the few that Exupery compiles and adding
> indirect literal addressing used to access VM variables.
>
> What do I mean by equivalent to C? Basically being able to fully
> optimise predictable code and optimise unpredictable code enough that
> the remaining inefficiencies are hidden behind L2/L3 cache misses and
> branch mispredicts. C was executing around 1 instruction per clock
> measured from system code, that leaves some room to hid inefficiencies.
>
> Why do I believe it's possible? With dynamic type information we can
> optimise for the common case, and after inlining loops we can often
> pull the type checks out of the loops. Then it's just a case of
> competing optimiser to optimiser but most of the gains for most code
> is likely to come from a small set of optimisations.
>
> I've also played around with a few thought experiments including
> looking at what it would take to compile a dot product with zero
> overhead inside the loops.

I'll ask again ;)  "What metrics lead you to believe you can double VW's
performance?"

Bryce
>
> (1) I really should move temps and arguments into registers
> too. Currently only stack values are held in registers. Moving
> more variables into registers only made sense after spilling
> directly into the context and reducing interference in the
> register allocator.
>

Cheers
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20090304/0c515e72/attachment.htm