[squeak-dev] Cog VM status update

Thu Dec 18 19:25:46 UTC 2008

Hi Juan,

On Wed, Dec 17, 2008 at 5:31 AM, Juan Vuletich <juan at jvuletich.org> wrote:

> Hi Andreas, Eliot,
>
> Thank you very much for this effort, for funding it, and for making it
> available to the community!
>
>  ... it is possible (albeit unlikely at this point) that the focus might
>> shift towards FFI speed or float inlining....
>>
>
> Can you tell a bit more about "float inlining"? I guess you're talking
> about immediate (unboxed) floats, right? That could mean no longer need to
> do plugins for numerical stuff. I would love to have that!

 I tried to post a reply yesterday but hit the 100k limit and the list
moderator refused to let my reply in anyway.

There are two things here.  Yes, one is doing immediate floats in a 64-bit
VM, which can produce floating-point that runs at half SmallInteger speed,
perhaps three times faster than boxed floats.

But much more interesting is an adaptive optimization/speculative inlining
scheme which aims to map floating-point operations down to the processor's
floating-point registers.  ere;s an abstract from my AOStA design sketch
(that I tried to post yesterday but was bounced) that describes how this
might be done.  The basic idea for an adaptive optimizer is to use two
levels, one bytecode to bytecode and another bytecode to machine code.  The
bytecode to bytecode level is written in Smalltalk and is clever (does type
analisys, inlining, etc).  The bytecode to machine code level is not clever,
but is processor-specific.

The bytecode to bytecode optimizer targets bytecodes a little like the
special selectors that define optimized operations from which the bytecode
to machine code compiler can generate fast code.  Conceptually this fast
bytecode runs in OptimizedContexts, but the virtual machine and bytecode to
machine code compiler arrange that it actually runs on a native stack in
native machine registers, including the FPU.  With that said the following
might make sense:

3.5 Initial Floating-Point Unboxing Scheme

While it should be a goal to unbox floats in pointer instances this sketch
ignores that possibility for now.  Smalltalk imposes no restriction on the
type of object stored an a pointer instance variable.  Therefore any
unboxing scheme needs to be per-instance, not just per-class (although one
could imagine a scheme that used anonymous behaviors to distinguish
instances of a class that contained unboxed data from instances of the same
class that did not).  At least in the HPS memory manager such flexibility
poses a problem and I would like to make immediate progress.  So this
unboxing scheme only handles unboxing within an OptimizedContext, being
rather analogous to a floating-point co-processor unit.

An OptimizedContext has two stacks, one for normal objects, and one for raw
data. The raw stack is organized as a number of slots large enough to hold
the largest floating-point format supported by the Smalltalk VM.  The size
of an OptimizedContext's stack is zero by default.  If non-zero its size is
defined by information in the context's OptimizedMethod, e.g. either a field
in the header, or some initial bytecode (analogous to pushCopiedValues at
the start of a copying block) that specifies the number of slots.  The stack
can be implemented as a pair instance variables in OptimizedContext that are
normally nil, but otherwise contain a suitably large ByteArray and a raw
stack pointer.  Whenever an OptimizedMethod that specifies a non-empty raw
stack is activated the initial contents are undefined and the stack pointer
is 0 (1 relative), i.e. there is no support for floating-point arguments.
 It is assumed that in-lining will reduce the demand for floating-point
parameter passing enough for it to be lived without.

A set of in-line primitives can access the raw stack as IEEE floating-point
data, moving values between the raw stack and the pointer stack or object
fields.  The primitive set would be extended to support unboxed access to
fields in pointer instances if and when required.  On Smalltalks with
different sized floating-point classes (VisualWorks supports 32-bit Float
and 64-bit Double) the primitive set may provide access to each float size.
 Here we sketch only a set for 64-bit Double floating-point values.  If the
set handles multiple sizes of data, each slot can hold only one instance of
a small value.

The set of raw stack primitives are stack based because it is much easier to
map a stack-based addressing scheme with a finite sized stack onto a
register set than it is to map a register-based scheme onto a stack, and the
infamous x86 floating-point processor, which is stack-based, is likely to
remain an important target for users of this system.

The raw stack could also be used to optimize integer arithmetic, supporting
arithmetic on untagged 8, 16, 32 and 64-bit widths as in Java.  Rather than
waste time specifying this I'll leave open the possibility of adding a set
of bytecodes to allow 64-bit arithmetic and conversion to and from tagged
and boxed SmallIntegers and LargeIntegers on the normal stack.

The paragraph about the x86 FPU is way out of date.  So imagine that the
bytecode to machine code compiler maps some portion of the raw data stack
onto the mmx registers.  Either it or the bytecode to bytecode compiler can
do a usage frequency analysis to arrange that the most frequently used stack
slots get mapped to the registers.

If this can work then yes, plugins could become a thing of the past.
 However, I doubt very much that Qwaq will fund me to do this.  Right now
with a Squeak VM that is 10 to 20 times slower than VisualWorks' VM Qwaq
Forums spends roughly 2/3rds of its time executing Smalltalk.  The bulk of
the rest of the time is in OpenGL.  The Cog JIT I'm working on now should be
able to reach ViaualWorks VM speeds and hence the 66.6% should become no
more than, say, 6% of entire execution time, with say 90% of the time in
OpenGL.

A second stage JIT doing adaptive optimization/speculative inlining could
probably improve performance by another factor of three.  But that would
produce only a 4% improvement in Qwaq Forums performance.  Yes, it might
allow Qwaq to rewrite all their C plugin code in Smalltalk and get the same
performance from the Smalltalk code that would then be easier to maintain
and enhance etc.  But where is the return on investment (ROI)?

The system would not be measurably faster for Qwaq Forums.  The
maintainability/extensibility benefits are intangible and hard to sell to
investors.  Hence I don't see Qwaq funding this, and it it was my call and
my money I'd probably make the same decision.  However, we haven't even
begun to discuss this inside Qwaq so you never know.

best
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20081218/5e0efce9/attachment.htm