Re: [Vm-dev] Cog and Exupery

5 Mar 2009


      Eliot Miranda writes:
...
On Wed, Mar 4, 2009 at 2:47 PM, bryce@kampjes.demon.co.uk wrote:
...
Eliot Miranda writes:
...
On Sun, Mar 1, 2009 at 1:46 PM, bryce@kampjes.demon.co.uk wrote:
I've thought about using a context stack several times over the years.
...
The key benefits are faster returns
and much faster sends
Well sends with few arguments could be optimised without a context
stack. Pass the arguments in registers, getting a context from the
list is quick, and the PIC uses an unconditional jump rather than a
call.
Getting anything off a list is not quick compared with a call.  taking
things off a list involves a write.  Calls only write a return address to
the stack (and then only on CISCs) which is well optimized by current
processors.
PICs don't eliminate calls.  A send is a call whether through a PIC or not.
 One calls the PIC (if only to collect a return address) and the PIC jumps.
 There's still always a call.
Do some measurements on modern processors and see if there are significant
differences between call and unconditional jump.  A call is an unconditional
jump plus the pushing of a return address.  When I last looked at this
carefully I found there was no significant difference in performance unless
one was being deeply recursive.  i.e. attempting to use unconditional jumps
to save performance, e.g. by inlining accessors, gained nothing because the
processor was taking care to make calls and returns of the same cost as
unconditional jumps (this on x86 in the 2004 timeframe).
No arguments that unconditional jumps are faster than calls, I'd
guess they're the same without looking at reference materal or
timing.
When I added adaptive inlining of primitives it doubled Exupery's
bytecode performance on the bytecode benchmark by inlining #at:
and #at:put: rather than accessing them via a PIC.
The cost of sends are:
 1) copying arguments
 2) finding the method to call
 3) getting memory for a new context
Copying arguments can be often avoided by passing the arguments
in registers.
Finding the method to call is the same with or without a context
stack.
Getting the memory for a new context will take a handful of
instructions in both cases.
The worst case sequence to get an item from a linked list is:
  1 - load (freeList) temp1
  2 - compare temp1 0
  3 - jumpEqual listEmptyBlock
  4 - load (temp1 nextOffset) temp2
  5 - store temp2 (freeList)
For the stack case:
  1 - load (overflowBlock) temp1
  2 - add newSize stackPointer
  3 - compare temp1 stackPointer
  4 - jumpGreater stackOverflowBlock
OK, the stack case wins by one instruction. Both could be optimised
by using registers. In the linked list case instructions 1 and 4 can
be removed. In the stack case just 1 instruction.
Writes are quick so long as the CPUs write bandwidth hasn't been
exceeded. Write performance has improved recently with write back
caches added for multi-cores.
Using a context cache will make a large improvement gain for returns.
There's no way I can see to avoid the checks that the returnee is good
when executing off contexts. With a stack the thunk can handle that
along with underflow.
...
It's even possible to avoid adjusting the size of the stack frame in
...
most cases by allocating a frame big enough for most contexts. Exupery
only uses the stack frame for spilling internal temporary registers so
it's use tends to be small. Stack variables spill directly into the
context. (1)
I don't understand the implication here.  With a stack the frame size is
essentially irrelevant.  With a context it certainly is not.  If all
contexts are large then one eats through memory faster and pays the price.
It removes the need for the two instruction to adjust the C stack pointer.
...
...
Here's the current benchmarks:
arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
 bytecodeBenchmark 724 compiled 250 ratio: 2.895
 sendBenchmark 663 compiled 385 ratio: 1.722
 doLoopsBenchmark 381 compiled 235 ratio: 1.621
 pointCreation 394 compiled 389 ratio: 1.013
 largeExplorers 269 compiled 210 ratio: 1.280
 compilerBenchmark 273 compiled 250 ratio: 1.092
 Cumulative Time 413.408 compiled 232.706 ratio 1.777
ExuperyBenchmarks>>arithmeticLoop 103ms
 SmallInteger>>benchmark 341ms
 InstructionStream>>interpretExtension:in:for: 6069ms
 Average 597.360
largeExplorers and compilerBenchmark are both real code. They do vary
depending on what the profiler decides to profile. The rest are micro
benchmarks. The benchmark suite looks much worse since upgrading to
a Core 2 which is very efficient at running the interpreter.
Which package version and repository are you taking the benchmarks from?
 I'd like to run them in Cog.  Would you be interested in running the
compiler language shootout benchmarks I'm using?
I'd be interested in adding your benchmarks to my suite. Given I've
fairly much decided to implement adaptive method inlining next, I'm
unlikely to do much tuning based on them.
My benchmarks are in the Exupery package on SqueakSource here:
http://www.squeaksource.com/Exupery.html
To run them use:
ExuperyBenchmarks new run
...
...
I've also played around with a few thought experiments including
looking at what it would take to compile a dot product with zero
overhead inside the loops.
I'll ask again ;)  "What metrics lead you to believe you can double VW's
performance?"
The best detailed analysis is in the appendix here:
http://www.kampjes.demon.co.uk/articles/exuperyDesign.pdf
but it's a little dated now.
The following article is a bit more up to date and describes why
Exupery's high level design is the way it is.
http://www.kampjes.demon.co.uk/articles/exuperyDirection.pdf
Bryce

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Vm-dev] Cog and Exupery