[Vm-dev] Cog and Exupery

Eliot Miranda eliot.miranda at gmail.com
Thu Mar 5 01:10:32 UTC 2009


On Wed, Mar 4, 2009 at 4:29 PM, Igor Stasenko <siguctua at gmail.com> wrote:

>
> 2009/3/5 Eliot Miranda <eliot.miranda at gmail.com>:
> >
> >
> >
> > On Wed, Mar 4, 2009 at 2:47 PM, <bryce at kampjes.demon.co.uk> wrote:
> >>
> >> Eliot Miranda writes:
> >>  >  On Sun, Mar 1, 2009 at 1:46 PM, <bryce at kampjes.demon.co.uk> wrote:
> >>  > I've thought about using a context stack several times over the
> years.
> >>  > > The key benefits are faster returns
> >>  >
> >>  > and much faster sends
> >>
> >> Well sends with few arguments could be optimised without a context
> >> stack. Pass the arguments in registers, getting a context from the
> >> list is quick, and the PIC uses an unconditional jump rather than a
> >> call.
> >
> > Getting anything off a list is not quick compared with a call.  taking
> things off a list involves a write.  Calls only write a return address to
> the stack (and then only on CISCs) which is well optimized by current
> processors.
> > PICs don't eliminate calls.  A send is a call whether through a PIC or
> not.  One calls the PIC (if only to collect a return address) and the PIC
> jumps.  There's still always a call.
> > Do some measurements on modern processors and see if there are
> significant differences between call and unconditional jump.  A call is an
> unconditional jump plus the pushing of a return address.  When I last looked
> at this carefully I found there was no significant difference in performance
> unless one was being deeply recursive.  i.e. attempting to use unconditional
> jumps to save performance, e.g. by inlining accessors, gained nothing
> because the processor was taking care to make calls and returns of the same
> cost as unconditional jumps (this on x86 in the 2004 timeframe).
> >
> >> It's even possible to avoid adjusting the size of the stack frame in
> >> most cases by allocating a frame big enough for most contexts. Exupery
> >> only uses the stack frame for spilling internal temporary registers so
> >> it's use tends to be small. Stack variables spill directly into the
> >> context. (1)
> >
> > I don't understand the implication here.  With a stack the frame size is
> essentially irrelevant.  With a context it certainly is not.  If all
> contexts are large then one eats through memory faster and pays the price.
> >
> >>  > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
> >>  > >  > Smalltalk-intensive benchmarks some time later this year.
> >>  > >
> >>  > > My original and current aim is double VW's performance or be
> roughly
> >>  > > equivalent to C.
> >>  >
> >>  >
> >>  > And what's the state of this effort?  What metrics lead you to
> believe you
> >>  > can double VW's performance?  Where are you currently?  What do you
> mean by
> >>  > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
> >>  > compiler?  What benchmarks have you focussed on?
> >>
> >> Exupery is fairly stable and providing modest gains but needs more
> >> tuning before it's performing as it should. Effort could now go into
> >> either tuning the current engine so it performs as it should or adding
> >> full method inlining. Until recently reliability was the main
> >> show-stopper to actual use. There's still plenty of areas to optimise
> >> to improve both compile time and run time.
> >>
> >> Here's the current benchmarks:
> >>
> >>  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
> >>  bytecodeBenchmark 724 compiled 250 ratio: 2.895
> >>  sendBenchmark 663 compiled 385 ratio: 1.722
> >>  doLoopsBenchmark 381 compiled 235 ratio: 1.621
> >>  pointCreation 394 compiled 389 ratio: 1.013
> >>  largeExplorers 269 compiled 210 ratio: 1.280
> >>  compilerBenchmark 273 compiled 250 ratio: 1.092
> >>  Cumulative Time 413.408 compiled 232.706 ratio 1.777
> >>
> >>  ExuperyBenchmarks>>arithmeticLoop 103ms
> >>  SmallInteger>>benchmark 341ms
> >>  InstructionStream>>interpretExtension:in:for: 6069ms
> >>  Average 597.360
> >>
> >> largeExplorers and compilerBenchmark are both real code. They do vary
> >> depending on what the profiler decides to profile. The rest are micro
> >> benchmarks. The benchmark suite looks much worse since upgrading to
> >> a Core 2 which is very efficient at running the interpreter.
> >
> > Which package version and repository are you taking the benchmarks from?
>  I'd like to run them in Cog.  Would you be interested in running the
> compiler language shootout benchmarks I'm using?
> >
> >> Optimisation so far has been about removing gross inefficiencies. I
> >> avoid any optimisation that's likely to make other cases worse. e.g.
> >> the last two optimisations were calling all primitives from native
> >> code rather than just the few that Exupery compiles and adding
> >> indirect literal addressing used to access VM variables.
> >>
> >> What do I mean by equivalent to C? Basically being able to fully
> >> optimise predictable code and optimise unpredictable code enough that
> >> the remaining inefficiencies are hidden behind L2/L3 cache misses and
> >> branch mispredicts. C was executing around 1 instruction per clock
> >> measured from system code, that leaves some room to hid inefficiencies.
> >>
> >> Why do I believe it's possible? With dynamic type information we can
> >> optimise for the common case, and after inlining loops we can often
> >> pull the type checks out of the loops. Then it's just a case of
> >> competing optimiser to optimiser but most of the gains for most code
> >> is likely to come from a small set of optimisations.
> >>
> >> I've also played around with a few thought experiments including
> >> looking at what it would take to compile a dot product with zero
> >> overhead inside the loops.
> >
> > I'll ask again ;)  "What metrics lead you to believe you can double VW's
> performance?"
>
> Eliot, even if you see it unrealistic, i think this is a good aim to
> try to achieve :)


I don't think it's unrealistic at all.  It's my aim too!  There are other
Smaltalk VMs out there that are faster than VW.  I just think Bryce is
making some strange decisions (not eliminating contexts early on) and I want
to try and understand why he is going a different way from me.  I might be
very wrong in my approach.  On the other hand, Bryce might be going the
wrong way, and my questions might help.

In any case, doubling VW's performance is a great idea.  But what's the
claim based on?  How is Bryce going to beat VW for what benchmarks?  If it's
not based on any analysis its an empty claim.  If it's based on real
analysis then cool, and I'd like to understand the analysis.


>>
> >> Bryce
> >>
> >> (1) I really should move temps and arguments into registers
> >> too. Currently only stack values are held in registers. Moving
> >> more variables into registers only made sense after spilling
> >> directly into the context and reducing interference in the
> >> register allocator.
> >
> > Cheers
> > Eliot
> >
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20090304/86d17186/attachment.htm


More information about the Vm-dev mailing list