Hi Juan,<br><br><div class="gmail_quote">On Wed, Dec 17, 2008 at 5:31 AM, Juan Vuletich <span dir="ltr"><<a href="mailto:juan@jvuletich.org">juan@jvuletich.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Andreas, Eliot,<br>
<br>
Thank you very much for this effort, for funding it, and for making it available to the community!<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
... it is possible (albeit unlikely at this point) that the focus might shift towards FFI speed or float inlining....<br>
</blockquote>
<br>
Can you tell a bit more about "float inlining"? I guess you're talking about immediate (unboxed) floats, right? That could mean no longer need to do plugins for numerical stuff. I would love to have that!</blockquote>
<div><br></div><div> I tried to post a reply yesterday but hit the 100k limit and the list moderator refused to let my reply in anyway. </div><div><br></div><div>There are two things here. Yes, one is doing immediate floats in a 64-bit VM, which can produce floating-point that runs at half SmallInteger speed, perhaps three times faster than boxed floats.</div>
<div><br></div><div>But much more interesting is an adaptive optimization/speculative inlining scheme which aims to map floating-point operations down to the processor's floating-point registers. ere;s an abstract from my AOStA design sketch (that I tried to post yesterday but was bounced) that describes how this might be done. The basic idea for an adaptive optimizer is to use two levels, one bytecode to bytecode and another bytecode to machine code. The bytecode to bytecode level is written in Smalltalk and is clever (does type analisys, inlining, etc). The bytecode to machine code level is not clever, but is processor-specific.</div>
<div><br></div><div>The bytecode to bytecode optimizer targets bytecodes a little like the special selectors that define optimized operations from which the bytecode to machine code compiler can generate fast code. Conceptually this fast bytecode runs in OptimizedContexts, but the virtual machine and bytecode to machine code compiler arrange that it actually runs on a native stack in native machine registers, including the FPU. With that said the following might make sense:</div>
<div><br></div><div><br></div><div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">3.5 Initial Floating-Point Unboxing Scheme</span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;"><br>
</span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">While it should be a goal to unbox floats in pointer instances this sketch ignores that possibility for now. Smalltalk imposes no restriction on the type of object stored an a pointer instance variable. Therefore any unboxing scheme needs to be per-instance, not just per-class (although one could imagine a scheme that used anonymous behaviors to distinguish instances of a class that contained unboxed data from instances of the same class that did not). At least in the HPS memory manager such flexibility poses a problem and I would like to make immediate progress. So this unboxing scheme only handles unboxing within an OptimizedContext, being rather analogous to a floating-point co-processor unit.</span></div>
<div><span class="Apple-style-span" style="font-family: 'times new roman', serif;"><br></span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">An OptimizedContext has two stacks, one for normal objects, and one for raw data. The raw stack is organized as a number of slots large enough to hold the largest floating-point format supported by the Smalltalk VM. The size of an OptimizedContext's stack is zero by default. If non-zero its size is defined by information in the context's OptimizedMethod, e.g. either a field in the header, or some initial bytecode (analogous to pushCopiedValues at the start of a copying block) that specifies the number of slots. The stack can be implemented as a pair instance variables in OptimizedContext that are normally nil, but otherwise contain a suitably large ByteArray and a raw stack pointer. Whenever an OptimizedMethod that specifies a non-empty raw stack is activated the initial contents are undefined and the stack pointer is 0 (1 relative), i.e. there is no support for floating-point arguments. It is assumed that in-lining will reduce the demand for floating-point parameter passing enough for it to be lived without.</span></div>
<div><span class="Apple-style-span" style="font-family: 'times new roman', serif;"><br></span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">A set of in-line primitives can access the raw stack as IEEE floating-point data, moving values between the raw stack and the pointer stack or object fields. The primitive set would be extended to support unboxed access to fields in pointer instances if and when required. On Smalltalks with different sized floating-point classes (VisualWorks supports 32-bit Float and 64-bit Double) the primitive set may provide access to each float size. Here we sketch only a set for 64-bit Double floating-point values. If the set handles multiple sizes of data, each slot can hold only one instance of a small value.</span></div>
<div><span class="Apple-style-span" style="font-family: 'times new roman', serif;"><br></span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">The set of raw stack primitives are stack based because it is much easier to map a stack-based addressing scheme with a finite sized stack onto a register set than it is to map a register-based scheme onto a stack, and the infamous x86 floating-point processor, which is stack-based, is likely to remain an important target for users of this system.</span></div>
<div><span class="Apple-style-span" style="font-family: 'times new roman', serif;"><br></span></div><div><span class="Apple-style-span" style="font-family: 'times new roman', serif;">The raw stack could also be used to optimize integer arithmetic, supporting arithmetic on untagged 8, 16, 32 and 64-bit widths as in Java. Rather than waste time specifying this I'll leave open the possibility of adding a set of bytecodes to allow 64-bit arithmetic and conversion to and from tagged and boxed SmallIntegers and LargeIntegers on the normal stack.</span></div>
<div><br></div><div>The paragraph about the x86 FPU is way out of date. So imagine that the bytecode to machine code compiler maps some portion of the raw data stack onto the mmx registers. Either it or the bytecode to bytecode compiler can do a usage frequency analysis to arrange that the most frequently used stack slots get mapped to the registers.</div>
<div><br></div><div>If this can work then yes, plugins could become a thing of the past. However, I doubt very much that Qwaq will fund me to do this. Right now with a Squeak VM that is 10 to 20 times slower than VisualWorks' VM Qwaq Forums spends roughly 2/3rds of its time executing Smalltalk. The bulk of the rest of the time is in OpenGL. The Cog JIT I'm working on now should be able to reach ViaualWorks VM speeds and hence the 66.6% should become no more than, say, 6% of entire execution time, with say 90% of the time in OpenGL.</div>
<div><br></div><div>A second stage JIT doing adaptive optimization/speculative inlining could probably improve performance by another factor of three. But that would produce only a 4% improvement in Qwaq Forums performance. Yes, it might allow Qwaq to rewrite all their C plugin code in Smalltalk and get the same performance from the Smalltalk code that would then be easier to maintain and enhance etc. But where is the return on investment (ROI)?</div>
<div><br></div><div>The system would not be measurably faster for Qwaq Forums. The maintainability/extensibility benefits are intangible and hard to sell to investors. Hence I don't see Qwaq funding this, and it it was my call and my money I'd probably make the same decision. However, we haven't even begun to discuss this inside Qwaq so you never know.</div>
<div></div></div></div><br><div>best</div><div>Eliot</div>