Brice,<div><br></div><div>    please forgive my earlier reply which I realise was extraordinarily rude and unnecessarily critical.  Let me try and engage more constructively.  I think I&#39;ll be making the same points but hopefulyl I&#39;ll do so while being less of an a***hole.  <br>

<br><div class="gmail_quote">On Mon, Feb 23, 2009 at 1:49 PM,  <span dir="ltr">&lt;<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="Ih2E3d"><br>

Eliot Miranda writes:<br>

 &gt;  On Sun, Feb 22, 2009 at 12:54 PM, &lt;<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>&gt; wrote:<br>

</div><div class="Ih2E3d"> &gt; &gt; All you need is the optimiser to run early in compilation for it to be<br>

 &gt; &gt; portable.<br>

 &gt;<br>

 &gt;<br>

 &gt; ...and for it to be untimely.  An adaptive optimizer by definition needs to<br>

 &gt; be running intermittently all the time.  It optimizes what is happening now,<br>

 &gt; not what happened at start-up.<br>

<br>

</div>Exupery runs as a Smalltalk background thread, it already uses dynamic<br>

feed back to inline some primitives including #at: and #at:put.</blockquote><div><br></div><div>The background thread was used by Typed Smalltalk and is also used by some Java jits.  But how does it running in a background thread help it be portable?  Surely if it targets native code it needs to be split into a front-end and a back-end of which only the front end will be portable right?</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d"> &gt; &gt; I see only one sixth of the time going into context creation for the<br>

 &gt; &gt; send benchmark which is about as send heavy as you can get. That&#39;s<br>

 &gt; &gt; running native code at about twice Squeak&#39;s speed. Also there&#39;s still<br>

 &gt; &gt; plenty of inefficiency in Exupery&#39;s call return sequences.</div></blockquote><div><br></div><div>As your VM gets faster so that 1/6th will loom ever larger.  If you triple the speed of your VM while keeping that same context creation scheme then that 1/6 will become 1/2 of entire execution time.  So if you want truly high performance you&#39;re going to have to tackle context elimination at some stage.  Since it is so integral to the central issue of call/return design I would encourage you to address it earlier rather than later.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d"><br>

 &gt;<br>

 &gt;<br>

 &gt; So you could get a 17% speedup if you could remove the context overhead.<br>

 &gt;  That&#39;s quite a tidy gain.  I see a 26% increase in benchFib performance<br>

 &gt; between base Squeak and the StackVM with no native code at all.<br>

 &gt;<br>

 &gt; What are the inefficiences in Exupery&#39;s call return sequences?<br>

<br>

</div>Exupery uses a C call sequence so it&#39;s easy to enter from the<br>

interpreter, that C call frame is torn down when exiting each<br>

compiled method then re-created when reentering native code. That&#39;s<br>

a complete waste when going from one native method to another.</blockquote><div><br></div><div>One sage piece of advice is to optimize for the common case.  Try and make the common case as fast as possible.  You know this since you&#39;re also interested in adaptive optimization which is fundamentally to do with optimizing the common case.  So design your calling convention around machine-code to machine-code calls and make the uncommon interpreter/machine-code call do the necessary work to interface with the calling-convention not the other way around.  The dog should wag the tail (although try telling my father-in-law&#39;s labrador puppy that).</div>

<div><br></div><div>The way I&#39;ve done this in Cog is to generate a trivial piece of machine code, a thunk/trampoline etc, that I actually call an enilopmart because it jumps from the interpreter/run-time into machine-code whereas jumps in the other direction are via trampolines.  The interpreter uses this by pushing the pc of the first instruction past the in-line cache checking code in the method, followed by the values of the register(s) that need loading and then calls the enilopmart. The enilopmart assigns stack and frame pointers with that of the machine-code frame being called, pops all the register values that are live on entry to the method off the stack and returns.  The return is effectively a jump to the start of the machine-code method.</div>

<div><br></div><div>To arrange that returns form machine-code frames don&#39;t have to check for a return to the interpreter the interpreter saves its own instruction pointer in a slot in its frame, and substitutes the address of a routine that handles returning to the interpreter as the return address.  So when the machine code frame returns it&#39;ll return to the trampoline that retrieves the saved instruction pointer form the slot and longjmps back to the interpreter.  I use a lngjmp to avoid stack growth in any dance between interpreter and machine code.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Also the send/return sequence isn&#39;t yet that optimised, there&#39;s still<br>

plenty of inefficiencies due to lack of addressing modes etc and because<br>

it&#39;s fairly naive translation of the interpreters send code.</blockquote><div><br></div><div>I would sit down and draw what you want the calling convention to look like and do it sooner rather than later.  This is crucial to overall performance until you solve the more difficult problem of doing significant adaptive optimization so that call/return is eliminated.  However, as mentioned previously eliminating call/return isn&#39;t such a great idea per se and so keeping a fast call/return sequence is probably a very good idea anyway.  Its not as if modern processors don&#39;t do call/return well.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">17% would be rather optimistic, some of the work required to set up a<br>

context will always be required. Temporaries will still need to be<br>

nilled out etc.</blockquote><div><br></div><div>Aim higher :)  I&#39;m hoping for 10x current Squeak performance for Smalltalk-intensive benchmarks some time later this year.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

<br>

Bryce<br>

</blockquote></div><br></div><div>Cheers</div><div>and again apologies for having been so critical</div>