[Vm-dev] Cog and Exupery

Thu Mar 5 21:19:37 UTC 2009

Eliot Miranda writes:
 >  On Wed, Mar 4, 2009 at 2:47 PM, <bryce at kampjes.demon.co.uk> wrote:
 > 
 > >
 > > Eliot Miranda writes:
 > >  >  On Sun, Mar 1, 2009 at 1:46 PM, <bryce at kampjes.demon.co.uk> wrote:
 > >  > I've thought about using a context stack several times over the years.
 > >  > > The key benefits are faster returns
 > >  >
 > >  > and much faster sends
 > >
 > > Well sends with few arguments could be optimised without a context
 > > stack. Pass the arguments in registers, getting a context from the
 > > list is quick, and the PIC uses an unconditional jump rather than a
 > > call.
 > 
 > 
 > Getting anything off a list is not quick compared with a call.  taking
 > things off a list involves a write.  Calls only write a return address to
 > the stack (and then only on CISCs) which is well optimized by current
 > processors.
 > 
 > PICs don't eliminate calls.  A send is a call whether through a PIC or not.
 >  One calls the PIC (if only to collect a return address) and the PIC jumps.
 >  There's still always a call.
 > 
 > Do some measurements on modern processors and see if there are significant
 > differences between call and unconditional jump.  A call is an unconditional
 > jump plus the pushing of a return address.  When I last looked at this
 > carefully I found there was no significant difference in performance unless
 > one was being deeply recursive.  i.e. attempting to use unconditional jumps
 > to save performance, e.g. by inlining accessors, gained nothing because the
 > processor was taking care to make calls and returns of the same cost as
 > unconditional jumps (this on x86 in the 2004 timeframe).

No arguments that unconditional jumps are faster than calls, I'd
guess they're the same without looking at reference materal or
timing. 

When I added adaptive inlining of primitives it doubled Exupery's
bytecode performance on the bytecode benchmark by inlining #at:
and #at:put: rather than accessing them via a PIC. 

The cost of sends are:
 1) copying arguments
 2) finding the method to call
 3) getting memory for a new context

Copying arguments can be often avoided by passing the arguments
in registers.

Finding the method to call is the same with or without a context
stack.

Getting the memory for a new context will take a handful of
instructions in both cases. 

The worst case sequence to get an item from a linked list is:
  1 - load (freeList) temp1
  2 - compare temp1 0
  3 - jumpEqual listEmptyBlock
  4 - load (temp1 nextOffset) temp2
  5 - store temp2 (freeList)

For the stack case:
  1 - load (overflowBlock) temp1
  2 - add newSize stackPointer
  3 - compare temp1 stackPointer
  4 - jumpGreater stackOverflowBlock

OK, the stack case wins by one instruction. Both could be optimised
by using registers. In the linked list case instructions 1 and 4 can
be removed. In the stack case just 1 instruction.

Writes are quick so long as the CPUs write bandwidth hasn't been
exceeded. Write performance has improved recently with write back
caches added for multi-cores.

Using a context cache will make a large improvement gain for returns.
There's no way I can see to avoid the checks that the returnee is good
when executing off contexts. With a stack the thunk can handle that
along with underflow.

 > It's even possible to avoid adjusting the size of the stack frame in
 > > most cases by allocating a frame big enough for most contexts. Exupery
 > > only uses the stack frame for spilling internal temporary registers so
 > > it's use tends to be small. Stack variables spill directly into the
 > > context. (1)
 > 
 > 
 > I don't understand the implication here.  With a stack the frame size is
 > essentially irrelevant.  With a context it certainly is not.  If all
 > contexts are large then one eats through memory faster and pays the price.

It removes the need for the two instruction to adjust the C stack pointer.

 > > Here's the current benchmarks:
 > >
 > >  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
 > >  bytecodeBenchmark 724 compiled 250 ratio: 2.895
 > >  sendBenchmark 663 compiled 385 ratio: 1.722
 > >  doLoopsBenchmark 381 compiled 235 ratio: 1.621
 > >  pointCreation 394 compiled 389 ratio: 1.013
 > >  largeExplorers 269 compiled 210 ratio: 1.280
 > >  compilerBenchmark 273 compiled 250 ratio: 1.092
 > >  Cumulative Time 413.408 compiled 232.706 ratio 1.777
 > >
 > >  ExuperyBenchmarks>>arithmeticLoop 103ms
 > >  SmallInteger>>benchmark 341ms
 > >  InstructionStream>>interpretExtension:in:for: 6069ms
 > >  Average 597.360
 > >
 > > largeExplorers and compilerBenchmark are both real code. They do vary
 > > depending on what the profiler decides to profile. The rest are micro
 > > benchmarks. The benchmark suite looks much worse since upgrading to
 > > a Core 2 which is very efficient at running the interpreter.
 > 
 > 
 > Which package version and repository are you taking the benchmarks from?
 >  I'd like to run them in Cog.  Would you be interested in running the
 > compiler language shootout benchmarks I'm using?

I'd be interested in adding your benchmarks to my suite. Given I've
fairly much decided to implement adaptive method inlining next, I'm
unlikely to do much tuning based on them.

My benchmarks are in the Exupery package on SqueakSource here:

  http://www.squeaksource.com/Exupery.html

To run them use:

  ExuperyBenchmarks new run

 > > I've also played around with a few thought experiments including
 > > looking at what it would take to compile a dot product with zero
 > > overhead inside the loops.
 > 
 > 
 > I'll ask again ;)  "What metrics lead you to believe you can double VW's
 > performance?"

The best detailed analysis is in the appendix here:

   http://www.kampjes.demon.co.uk/articles/exuperyDesign.pdf

but it's a little dated now.

The following article is a bit more up to date and describes why
Exupery's high level design is the way it is.

    http://www.kampjes.demon.co.uk/articles/exuperyDirection.pdf

Bryce