[Vm-dev] Interpreter>>isContextHeader: optimization

Mon Mar 2 22:51:44 UTC 2009

On Sun, Mar 1, 2009 at 2:21 PM, Eliot Miranda <eliot.miranda at gmail.com>wrote:

>
>
> On Sun, Mar 1, 2009 at 1:46 PM, <bryce at kampjes.demon.co.uk> wrote:
>
>>
>> Eliot Miranda writes:
>>  >  Brice,
>>  >     please forgive my earlier reply which I realise was extraordinarily
>> rude
>>  > and unnecessarily critical.  Let me try and engage more constructively.
>>  I
>>  > think I'll be making the same points but hopefulyl I'll do so while
>> being
>>  > less of an a***hole.
>>  >
>>  > On Mon, Feb 23, 2009 at 1:49 PM, <bryce at kampjes.demon.co.uk> wrote:
>>  >
>>  > >
>>  > > Eliot Miranda writes:
>>  > >  >  On Sun, Feb 22, 2009 at 12:54 PM, <bryce at kampjes.demon.co.uk>
>> wrote:
>>  > >  > > All you need is the optimiser to run early in compilation for it
>> to be
>>  > >  > > portable.
>>  > >  >
>>  > >  >
>>  > >  > ...and for it to be untimely.  An adaptive optimizer by definition
>> needs
>>  > > to
>>  > >  > be running intermittently all the time.  It optimizes what is
>> happening
>>  > > now,
>>  > >  > not what happened at start-up.
>>  > >
>>  > > Exupery runs as a Smalltalk background thread, it already uses
>> dynamic
>>  > > feed back to inline some primitives including #at: and #at:put.
>>  >
>>  >
>>  > The background thread was used by Typed Smalltalk and is also used by
>> some
>>  > Java jits.  But how does it running in a background thread help it be
>>  > portable?  Surely if it targets native code it needs to be split into a
>>  > front-end and a back-end of which only the front end will be portable
>> right?
>>
>> Most of the code in the back end is portable too. Exupery's back end
>> is split into three stages, instruction selection, register
>> allocation, then assembly. The register allocator is the biggest and
>> most complex part of Exupery so far and is portable.
>>
>> Running as a background thread doesn't make it portable but it does
>> make it more timely than a compile on load system.
>>
>>  >
>>  > >  > > I see only one sixth of the time going into context creation for
>> the
>>  > >  > > send benchmark which is about as send heavy as you can get.
>> That's
>>  > >  > > running native code at about twice Squeak's speed. Also there's
>> still
>>  > >  > > plenty of inefficiency in Exupery's call return sequences.
>>  > >
>>  >
>>  > As your VM gets faster so that 1/6th will loom ever larger.  If you
>> triple
>>  > the speed of your VM while keeping that same context creation scheme
>> then
>>  > that 1/6 will become 1/2 of entire execution time.  So if you want
>> truly
>>  > high performance you're going to have to tackle context elimination at
>> some
>>  > stage.  Since it is so integral to the central issue of call/return
>> design I
>>  > would encourage you to address it earlier rather than later.
>>  >
>>  >
>>  > >
>>  > >  >
>>  > >  >
>>  > >  > So you could get a 17% speedup if you could remove the context
>> overhead.
>>  > >  >  That's quite a tidy gain.  I see a 26% increase in benchFib
>> performance
>>  > >  > between base Squeak and the StackVM with no native code at all.
>>  > >  >
>>  > >  > What are the inefficiences in Exupery's call return sequences?
>>  > >
>>  > > Exupery uses a C call sequence so it's easy to enter from the
>>  > > interpreter, that C call frame is torn down when exiting each
>>  > > compiled method then re-created when reentering native code. That's
>>  > > a complete waste when going from one native method to another.
>>  >
>>  >
>>  > One sage piece of advice is to optimize for the common case.  Try and
>> make
>>  > the common case as fast as possible.  You know this since you're also
>>  > interested in adaptive optimization which is fundamentally to do with
>>  > optimizing the common case.  So design your calling convention around
>>  > machine-code to machine-code calls and make the uncommon
>>  > interpreter/machine-code call do the necessary work to interface with
>> the
>>  > calling-convention not the other way around.  The dog should wag the
>> tail
>>  > (although try telling my father-in-law's labrador puppy that).
>>  >
>>  > The way I've done this in Cog is to generate a trivial piece of machine
>>  > code, a thunk/trampoline etc, that I actually call an enilopmart
>> because it
>>  > jumps from the interpreter/run-time into machine-code whereas jumps in
>> the
>>  > other direction are via trampolines.  The interpreter uses this by
>> pushing
>>  > the pc of the first instruction past the in-line cache checking code in
>> the
>>  > method, followed by the values of the register(s) that need loading and
>> then
>>  > calls the enilopmart. The enilopmart assigns stack and frame pointers
>> with
>>  > that of the machine-code frame being called, pops all the register
>> values
>>  > that are live on entry to the method off the stack and returns.  The
>> return
>>  > is effectively a jump to the start of the machine-code method.
>>
>> Why does Cog still rely on the interpreter? Is this just a
>> bootstrapping phase?
>
>
>  It is by design, given my experience with HPS (VW's VM).
>
> First, being able to rely on the interpreter means the JIT doesn't have to
> waste time compiling infrequently used or large methods.  There are methods,
> such as the Unicode initializers, that are humongous, are very rarely run,
> and when run only run once (e.g. once on start-up).  These methods typically
> take longer to compile than to interpret and consume huge amounts of code
> space.  It makes more sense to leave these to the interpreter.  My hunch is
> that overall performance will improve because machine code working set size
> will be kept small (currently Cog starts up a Qwaq Croquet development image
> generating less than 640k of code).
>
> Second, if one is using a fixed size code cache (there are advantages in
> being able to traverse all generated code quickly, and in keeping it in one
> place) then one must be able to survive running out of space.  In VW this
> means a few points where things get very tricky.  e.g. one is trying to link
> an inline cache to a newly compiled machine-code method target, but the
> compilation caused the code zone to be compacted, moving the current
> call-site one is trying to link into.  With the interpreter to fall back on
> the VM can always make progress even if it has run out of space.  Hence
> compacting the code zone can be deferred until the next event check, just as
> is done with GC.  This makes code compaction simpler and hence more reliable
> and quicker to implement.
>
> Third, keeping the interpreter around means maintaining the current levels
> of portability.  One can always fall back on the interpreter if one doesn't
> have the back-end for the current ISA or one is having to run in ROM.
>
> Fourth, I've long been curious about JIT/Interpreter hybrids and whether
> they're really hard to get to work well.  Turns out it is not too bad.  I of
> course had to iterate twice before I understood what I was doing, but the
> system seems to be quite comprehensible.  It is complicated, but not as
> complicated as HPS.
>

Oops!  I forgot to mention another important advantage.  With a pure JIT
then somehow the system has to cope with resuming execution at an arbitrary
bytecode.  In the Debugger one can step a context to a point that may not be
a resumption point in machine code.  In code generated by an optimizing JIT
that is doing "deferred code generation" (Ian Piumarta's term for
eliminating intermediate pushes and pops when generating machine code,
something HPS does too) there may only be resumption points for a small
subset of the bytecodes, e.g. the bytecode following a send, which maps to
the return from a call in machine code.  If you have a look at the VW
debugger you'll see that it has the responsibility of stepping a context up
to a bytecode pc that corresponds to a resumption point in machine code.
 This code is, um, opaque, and fragile, because it constitutes an unwritten
contract between the image and the VM.

 With an interpreter, however, the VM can simply resume execution in the
interpreter if a context is not at a point  that corresponds to a resumption
point in machine code.  Nice.  (and this advantage would also apply in SIStA
where the interpreter would still be capable of interpreting optimized
bytecode, apologies if this doesn't make sense; there's a lot of background
I'm skipping here).

 > To arrange that returns form machine-code frames don't have to check for
>> a
>>  > return to the interpreter the interpreter saves its own instruction
>> pointer
>>  > in a slot in its frame, and substitutes the address of a routine that
>>  > handles returning to the interpreter as the return address.  So when
>> the
>>  > machine code frame returns it'll return to the trampoline that
>> retrieves the
>>  > saved instruction pointer form the slot and longjmps back to the
>>  > interpreter.  I use a lngjmp to avoid stack growth in any dance between
>>  > interpreter and machine code.
>>  >
>>  >
>>  > > Also the send/return sequence isn't yet that optimised, there's still
>>  > > plenty of inefficiencies due to lack of addressing modes etc and
>> because
>>  > > it's fairly naive translation of the interpreters send code.
>>  >
>>  >
>>  > I would sit down and draw what you want the calling convention to look
>> like
>>  > and do it sooner rather than later.  This is crucial to overall
>> performance
>>  > until you solve the more difficult problem of doing significant
>> adaptive
>>  > optimization so that call/return is eliminated.  However, as mentioned
>>  > previously eliminating call/return isn't such a great idea per se and
>> so
>>  > keeping a fast call/return sequence is probably a very good idea
>> anyway.
>>  >  Its not as if modern processors don't do call/return well.
>>
>> For Exupery to make sense it needs significant adaptive optimisation
>> to expose enough code to allow heavy optimisation to justify the
>> slower compiler. Otherwise it would make more sense to move the
>> compiler into the VM and allow compilation for all execution like VW.
>>
>> The current system's primary goal is enable the development of the
>> adaptive optimisation. I'll tune to provide decent performance now but
>> not if it makes it harder to add key features later.
>
>
> AOStA/SIStA is I think a quicker route.  I like your suggestion of looking
> at moving Exupery to the stack VM.  Even though I would put the back-end
> code generator in the VM I would love to use Exupery's front-end in
> AOStA/SIStA.
>
> I've thought about using a context stack several times over the years.
>> The key benefits are faster returns
>
>
> and much faster sends
>
>
>> and possibly faster
>> de-optimisation of inlined contexts. After inlining it's possible that
>> a code change will break an optimisation. If an inlined method is
>> modified then it will continue to be entered until all contexts have
>> died or it is actively removed. Removing inlined contexts from object
>> memory requires a full memory scan to find them (allInstances).
>>
>
> There's another key advantage and that is the freedom to layout an
> optimized stack.  A key goal of AOStA/SIStA i unboxing floating-point values
> and mapping them to floating-point registers.   I like the idea of having an
> abstract model, represented by an OptimizedContext, of having two stacks, an
> object stack and a byte-data stack for raw data (unboxed floating-point,
> untagged integers, etc).  That would be good for the debugger n
> deopimzation.  But in the VM one would want to map these two different
> stacks to a single machine stack with info for the GC as to what are object
> references and what are not (a la Java stack frames), and having the active
> frame keep much of this unboxed state in machine regsiters.
>
>  > 17% would be rather optimistic, some of the work required to set up a
>>  > > context will always be required. Temporaries will still need to be
>>  > > nilled out etc.
>>  >
>>  >
>>  > Aim higher :)  I'm hoping for 10x current Squeak performance for
>>  > Smalltalk-intensive benchmarks some time later this year.
>>
>> My original and current aim is double VW's performance or be roughly
>> equivalent to C.
>
>
> And what's the state of this effort?  What metrics lead you to believe you
> can double VW's performance?  Where are you currently?  What do you mean by
> "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
> compiler?  What benchmarks have you focussed on?
>
> Bryce
>>
>
> Best
> Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20090302/cf0ddfdc/attachment-0001.htm