Eliot Miranda writes:
On Wed, Mar 4, 2009 at 2:47 PM, bryce@kampjes.demon.co.uk wrote:
Eliot Miranda writes:
On Sun, Mar 1, 2009 at 1:46 PM, bryce@kampjes.demon.co.uk wrote: I've thought about using a context stack several times over the years.
The key benefits are faster returns
and much faster sends
Well sends with few arguments could be optimised without a context stack. Pass the arguments in registers, getting a context from the list is quick, and the PIC uses an unconditional jump rather than a call.
Getting anything off a list is not quick compared with a call. taking things off a list involves a write. Calls only write a return address to the stack (and then only on CISCs) which is well optimized by current processors.
PICs don't eliminate calls. A send is a call whether through a PIC or not. One calls the PIC (if only to collect a return address) and the PIC jumps. There's still always a call.
Do some measurements on modern processors and see if there are significant differences between call and unconditional jump. A call is an unconditional jump plus the pushing of a return address. When I last looked at this carefully I found there was no significant difference in performance unless one was being deeply recursive. i.e. attempting to use unconditional jumps to save performance, e.g. by inlining accessors, gained nothing because the processor was taking care to make calls and returns of the same cost as unconditional jumps (this on x86 in the 2004 timeframe).
No arguments that unconditional jumps are faster than calls, I'd guess they're the same without looking at reference materal or timing.
When I added adaptive inlining of primitives it doubled Exupery's bytecode performance on the bytecode benchmark by inlining #at: and #at:put: rather than accessing them via a PIC.
The cost of sends are: 1) copying arguments 2) finding the method to call 3) getting memory for a new context
Copying arguments can be often avoided by passing the arguments in registers.
Finding the method to call is the same with or without a context stack.
Getting the memory for a new context will take a handful of instructions in both cases.
The worst case sequence to get an item from a linked list is: 1 - load (freeList) temp1 2 - compare temp1 0 3 - jumpEqual listEmptyBlock 4 - load (temp1 nextOffset) temp2 5 - store temp2 (freeList)
For the stack case: 1 - load (overflowBlock) temp1 2 - add newSize stackPointer 3 - compare temp1 stackPointer 4 - jumpGreater stackOverflowBlock
OK, the stack case wins by one instruction. Both could be optimised by using registers. In the linked list case instructions 1 and 4 can be removed. In the stack case just 1 instruction.
Writes are quick so long as the CPUs write bandwidth hasn't been exceeded. Write performance has improved recently with write back caches added for multi-cores.
Using a context cache will make a large improvement gain for returns. There's no way I can see to avoid the checks that the returnee is good when executing off contexts. With a stack the thunk can handle that along with underflow.
It's even possible to avoid adjusting the size of the stack frame in
most cases by allocating a frame big enough for most contexts. Exupery only uses the stack frame for spilling internal temporary registers so it's use tends to be small. Stack variables spill directly into the context. (1)
I don't understand the implication here. With a stack the frame size is essentially irrelevant. With a context it certainly is not. If all contexts are large then one eats through memory faster and pays the price.
It removes the need for the two instruction to adjust the C stack pointer.
Here's the current benchmarks:
arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875 bytecodeBenchmark 724 compiled 250 ratio: 2.895 sendBenchmark 663 compiled 385 ratio: 1.722 doLoopsBenchmark 381 compiled 235 ratio: 1.621 pointCreation 394 compiled 389 ratio: 1.013 largeExplorers 269 compiled 210 ratio: 1.280 compilerBenchmark 273 compiled 250 ratio: 1.092 Cumulative Time 413.408 compiled 232.706 ratio 1.777
ExuperyBenchmarks>>arithmeticLoop 103ms SmallInteger>>benchmark 341ms InstructionStream>>interpretExtension:in:for: 6069ms Average 597.360
largeExplorers and compilerBenchmark are both real code. They do vary depending on what the profiler decides to profile. The rest are micro benchmarks. The benchmark suite looks much worse since upgrading to a Core 2 which is very efficient at running the interpreter.
Which package version and repository are you taking the benchmarks from? I'd like to run them in Cog. Would you be interested in running the compiler language shootout benchmarks I'm using?
I'd be interested in adding your benchmarks to my suite. Given I've fairly much decided to implement adaptive method inlining next, I'm unlikely to do much tuning based on them.
My benchmarks are in the Exupery package on SqueakSource here:
http://www.squeaksource.com/Exupery.html
To run them use:
ExuperyBenchmarks new run
I've also played around with a few thought experiments including looking at what it would take to compile a dot product with zero overhead inside the loops.
I'll ask again ;) "What metrics lead you to believe you can double VW's performance?"
The best detailed analysis is in the appendix here:
http://www.kampjes.demon.co.uk/articles/exuperyDesign.pdf
but it's a little dated now.
The following article is a bit more up to date and describes why Exupery's high level design is the way it is.
http://www.kampjes.demon.co.uk/articles/exuperyDirection.pdf
Bryce