<br><br><div class="gmail_quote">On Sun, Feb 22, 2009 at 12:54 PM,  <span dir="ltr">&lt;<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="Ih2E3d"><br>

Eliot Miranda writes:<br>

&nbsp;&gt; &nbsp;On Sun, Feb 22, 2009 at 10:37 AM, &lt;<a href="mailto:bryce@kampjes.demon.co.uk">bryce@kampjes.demon.co.uk</a>&gt; wrote:<br>

&nbsp;&gt;<br>

&nbsp;&gt; &gt;<br>

&nbsp;&gt; &gt; Eliot Miranda writes:<br>

&nbsp;&gt; &gt; &nbsp;&gt;<br>

&nbsp;&gt; &gt; &nbsp;&gt; But what I really think is that this is too low a level to worry about.<br>

&nbsp;&gt; &gt; &nbsp;&gt; &nbsp;Much more important to focus on<br>

&nbsp;&gt; &gt; &nbsp;&gt; - context to stack mapping<br>

&nbsp;&gt; &gt; &nbsp;&gt; - in-line cacheing via a JIT<br>

&nbsp;&gt; &gt; &nbsp;&gt; - exploiting multicore via Hydra<br>

&nbsp;&gt; &gt; &nbsp;&gt; and beyond (e.g. speculative inlining)<br>

&nbsp;&gt; &gt; &nbsp;&gt; than worrying about tiny micro-optimizations like this :)<br>

&nbsp;&gt; &gt;<br>

&nbsp;&gt; &gt; If you&#39;re planning on adding speculative, I assume Self style dynamic,<br>

&nbsp;&gt; &gt; inlining won&#39;t that reduce the value of context to stack mapping?<br>

&nbsp;&gt;<br>

&nbsp;&gt;<br>

&nbsp;&gt; Not at all; in fact quite the reverse. &nbsp;Context to stack mapping allows one<br>

&nbsp;&gt; to retain contexts while having the VM execute efficient, stack-based code<br>

&nbsp;&gt; (i.e. using hardware call instructions). &nbsp;This in turn enables the entire<br>

&nbsp;&gt; adaptive optimizer, including the stack analyser and the<br>

&nbsp;&gt; bytecode-to-bytecode compiler/method inliner to be written in Smalltalk.<br>

&nbsp;&gt; &nbsp;The image level code can examine the run-time stack using contexts as their<br>

&nbsp;&gt; interface without having to understand native stack formats or different<br>

&nbsp;&gt; ISAs. &nbsp;The optimizer is therefore completely portable with all machine<br>

&nbsp;&gt; specificities confined to the underlying VM which is much simpler by virtue<br>

&nbsp;&gt; of not containing a sophisticated optimizer (which one would have to squeeze<br>

&nbsp;&gt; through Slang etc).<br>

<br>

</div>All you need is the optimiser to run early in compilation for it to be<br>

portable.</blockquote><div><br></div><div>...and for it to be untimely. &nbsp;An adaptive optimizer by definition needs to be running intermittently all the time. &nbsp;It optimizes what is happening now, not what happened at start-up.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">And we definately agree on trying to keep complex logic out of the<br>

VM. Sound&#39;s like you&#39;re thinking of AoSTa.<br>

<div class="Ih2E3d"></div></blockquote><div><br></div><div>yes (AOStA).</div><div>&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">&nbsp;&gt; So for me, context-to-stack mapping is fundamental to implementing<br>


&nbsp;&gt; speculative inlining in Smalltalk.<br>

&nbsp;&gt;<br>

&nbsp;&gt;<br>

&nbsp;&gt; My view with Exupery is context caches should be left until after<br>

&nbsp;&gt; &gt; dynamic inlining as their value will depend on how well dynamic<br>

&nbsp;&gt; &gt; inlining reduces the number of sends.<br>

&nbsp;&gt; &gt;<br>

&nbsp;&gt;<br>

&nbsp;&gt; I know and I disagree. &nbsp;Dynamic inlining depends on collecting good type<br>

&nbsp;&gt; information, something that inline caches do well. &nbsp;In-line caches are<br>

&nbsp;&gt; efficiently implemented with native call instructions, either to method<br>

&nbsp;&gt; entry-points or PIC jump tables. &nbsp;Native call instructions mesh well with<br>

&nbsp;&gt; stacks. &nbsp;So context-to-stack mapping, for me, is a sensible enabling<br>

&nbsp;&gt; optimization for speculative inlining because it meshes well with inline<br>

&nbsp;&gt; caches.<br>

<br>

</div>PICs are a separate issue. Exupery has PICs, and has had them for<br>

years now. PICs are just as easily implemented as jumps.<br>

<div class="Ih2E3d"></div></blockquote><div><br></div><div>Yes, PICs are jump tables. &nbsp;But, at least in my implementation and in others I know of, they get called. &nbsp;Tey are composed of a jump table that then jumps into methods at a point past any entry-point dynamic-binding/type checking.</div>

<div>&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">&nbsp;&gt; Further, context-to-stack mapping is such a huge win that it&#39;ll be of<br>

&nbsp;&gt; benefit even if the VM is spending 90% of its time in inlined call-less<br>

&nbsp;&gt; code. &nbsp;We see a speedup of very nearly 2x (48% sticks in my head) for one<br>

&nbsp;&gt; non-micro tree walking benchmark from the computer language shootout. &nbsp;And<br>

&nbsp;&gt; this is in a very slow VM. &nbsp;In a faster VM context-to-stack mapping would be<br>

&nbsp;&gt; even more valuable, because it would save an even greater percentage of<br>

&nbsp;&gt; overall execution time.<br>

<br>

</div>I see only one sixth of the time going into context creation for the<br>

send benchmark which is about as send heavy as you can get. That&#39;s<br>

running native code at about twice Squeak&#39;s speed. Also there&#39;s still<br>

plenty of inefficiency in Exupery&#39;s call return sequences.</blockquote><div><br></div><div>So you could get a 17% speedup if you could remove the context overhead. &nbsp;That&#39;s quite a tidy gain. &nbsp;I see a 26% increase in benchFib performance between base Squeak and the StackVM with no native code at all.</div>

<div><br></div><div>What are the inefficiences in&nbsp;Exupery&#39;s call return sequences?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">

&nbsp;&gt; Further still using call &amp; return instructions as conventionally as possible<br>

&nbsp;&gt; meshes extremely well with current processor implementations which, because<br>

&nbsp;&gt; of the extensive use thereon of conventional stack-oriented language<br>

&nbsp;&gt; implementations, have done a great job optimizing call/return.<br>

<br>

</div>Unconditional jumps for sends also benefit from hardware<br>

optimisation. Returns turn into indirect jumps which are less<br>

efficent, but getting better with Core 2.</blockquote><div><br></div><div>and Power&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>

<br>

Cheers<br>

Bryce<br>

</blockquote></div><br>