<br><br><div class="gmail_quote">On Sun, Feb 22, 2009 at 12:54 PM, <span dir="ltr"><<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="Ih2E3d"><br>
Eliot Miranda writes:<br>
> On Sun, Feb 22, 2009 at 10:37 AM, <<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>> wrote:<br>
><br>
> ><br>
> > Eliot Miranda writes:<br>
> > ><br>
> > > But what I really think is that this is too low a level to worry about.<br>
> > > Much more important to focus on<br>
> > > - context to stack mapping<br>
> > > - in-line cacheing via a JIT<br>
> > > - exploiting multicore via Hydra<br>
> > > and beyond (e.g. speculative inlining)<br>
> > > than worrying about tiny micro-optimizations like this :)<br>
> ><br>
> > If you're planning on adding speculative, I assume Self style dynamic,<br>
> > inlining won't that reduce the value of context to stack mapping?<br>
><br>
><br>
> Not at all; in fact quite the reverse. Context to stack mapping allows one<br>
> to retain contexts while having the VM execute efficient, stack-based code<br>
> (i.e. using hardware call instructions). This in turn enables the entire<br>
> adaptive optimizer, including the stack analyser and the<br>
> bytecode-to-bytecode compiler/method inliner to be written in Smalltalk.<br>
> The image level code can examine the run-time stack using contexts as their<br>
> interface without having to understand native stack formats or different<br>
> ISAs. The optimizer is therefore completely portable with all machine<br>
> specificities confined to the underlying VM which is much simpler by virtue<br>
> of not containing a sophisticated optimizer (which one would have to squeeze<br>
> through Slang etc).<br>
<br>
</div>All you need is the optimiser to run early in compilation for it to be<br>
portable.</blockquote><div><br></div><div>...and for it to be untimely. An adaptive optimizer by definition needs to be running intermittently all the time. It optimizes what is happening now, not what happened at start-up.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">And we definately agree on trying to keep complex logic out of the<br>
VM. Sound's like you're thinking of AoSTa.<br>
<div class="Ih2E3d"></div></blockquote><div><br></div><div>yes (AOStA).</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d"> > So for me, context-to-stack mapping is fundamental to implementing<br>
> speculative inlining in Smalltalk.<br>
><br>
><br>
> My view with Exupery is context caches should be left until after<br>
> > dynamic inlining as their value will depend on how well dynamic<br>
> > inlining reduces the number of sends.<br>
> ><br>
><br>
> I know and I disagree. Dynamic inlining depends on collecting good type<br>
> information, something that inline caches do well. In-line caches are<br>
> efficiently implemented with native call instructions, either to method<br>
> entry-points or PIC jump tables. Native call instructions mesh well with<br>
> stacks. So context-to-stack mapping, for me, is a sensible enabling<br>
> optimization for speculative inlining because it meshes well with inline<br>
> caches.<br>
<br>
</div>PICs are a separate issue. Exupery has PICs, and has had them for<br>
years now. PICs are just as easily implemented as jumps.<br>
<div class="Ih2E3d"></div></blockquote><div><br></div><div>Yes, PICs are jump tables. But, at least in my implementation and in others I know of, they get called. Tey are composed of a jump table that then jumps into methods at a point past any entry-point dynamic-binding/type checking.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d"> > Further, context-to-stack mapping is such a huge win that it'll be of<br>
> benefit even if the VM is spending 90% of its time in inlined call-less<br>
> code. We see a speedup of very nearly 2x (48% sticks in my head) for one<br>
> non-micro tree walking benchmark from the computer language shootout. And<br>
> this is in a very slow VM. In a faster VM context-to-stack mapping would be<br>
> even more valuable, because it would save an even greater percentage of<br>
> overall execution time.<br>
<br>
</div>I see only one sixth of the time going into context creation for the<br>
send benchmark which is about as send heavy as you can get. That's<br>
running native code at about twice Squeak's speed. Also there's still<br>
plenty of inefficiency in Exupery's call return sequences.</blockquote><div><br></div><div>So you could get a 17% speedup if you could remove the context overhead. That's quite a tidy gain. I see a 26% increase in benchFib performance between base Squeak and the StackVM with no native code at all.</div>
<div><br></div><div>What are the inefficiences in Exupery's call return sequences?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">
> Further still using call & return instructions as conventionally as possible<br>
> meshes extremely well with current processor implementations which, because<br>
> of the extensive use thereon of conventional stack-oriented language<br>
> implementations, have done a great job optimizing call/return.<br>
<br>
</div>Unconditional jumps for sends also benefit from hardware<br>
optimisation. Returns turn into indirect jumps which are less<br>
efficent, but getting better with Core 2.</blockquote><div><br></div><div>and Power </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>
<br>
Cheers<br>
Bryce<br>
</blockquote></div><br>