[Vm-dev] Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Sat Nov 20 18:48:11 UTC 2010

On Sat, Nov 20, 2010 at 9:32 AM, Colin Putney <colin at wiresong.com> wrote:

>
> On Sat, Nov 20, 2010 at 8:19 AM, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
> > Yes, I agree. But I don't think that AST vs bytecode is really anything
> to do
> with it; they can be easily transformed into each other (via decompiler &
> compiler). The bytecode is a convenient form because it is compact and can
> efficiently be interpreted. The issue is *when* and *where* to spend the
> cycles trying to optimise aggressively. That's where performance counters
> come
> in. If one decorates the jitted code with e.g. a taken and untaken count at
> each conditional branch then when these counters trip one suspends
> execution,
> examines the current call stack, collecting concrete type information from
> inline caches, and optimises several nested activations into a single large
> method that is worth optimising with traditional static techniques (good
> register allocation etc). If one tries to optimise everything the system
> becomes unresponsive (see Craig Chambers' Self 2 compiler). If one defers
> optimization until finding a "hot spot" things work much better (see Urs
> Höltzle's Self 3 compiler, HotSpot et al).
>
> > So
>
> > - keep bytecode and an interpreter for compactness, portability and the
>   ability to always fall back on the interpreter (e.g. when the JIT runs
> out
>   of memory during some tricky relinking operation)
>
> > - use a simple JIT to optimize code run more than once, that does a
>   reasonable job of stack to register mapping, implements PICs to collect
>   type info and performance counters to collect block usage and invoke the
>   aggressive optimizer
>
> > - use a speculative inliner and an aggressive optimiser to inline code
> based
>   on hot spots, basic block counts, and PIC info, and optimize it using
>   traditional techniques. > All of the above exists in various production
>   VMs, AFAIA none all in the same place. So the above is arguably a proven
>   architecture. Hence it is my direction (and Marcus' and I hope yours
> too).
>   I can send you my architectural sketch if you're interested.
>
> That would seem to describe Strongtalk pretty well, no? Or does it collect
> type and profile information directly from the interpreter?
>

We should check with Steve Rees but I think you're right.  As I understand
it the Strongtalk VM generated threaded code operations at startup and
jitted bytecode to threaded code.  PICs and method activation counters were
in the threaded code.  The optimizing compiler then examined threaded code.

Also, my impression was that Strongtalk, Hotspot et al did inlining *down*
> the
> stack. That is, it would find a method that was activated frequently, then
> inline whatever sends and block activations it performed to get a large,
> statically optimizable method.
>

Right. But it starts from the activation in which the counter trips.
 Finding out how far down to look is a heuristic.  In Self 3 if it tripped
within a block activation or a send within a block activation (or some small
number of activations down) it would choose the home context of that block.
 But I remember talking with David Griswold and IIRC he said in Strongtalk
it always compiles just one method out so that the inlining repeats
incrementally.  (I think).  IIRC, he said that this was the best heuristic,
better than Self 3's.

> What you mention above is slightly different, IIUC, in that you're
> finding a hot spot
> based on counters in basic blocks, then looking up the stack to find a
> method
> that can be aggressively optimized. This sounds a bit like the
> trace-based inlining
> that Mozilla used for their Javascript, in that you're effectively
> choosing to optimize
> a particular path through the code rather than a particular method.
> Thoughts?
>

Not really.  It is the same in both the activation counter and the basic
block counter.  If you look one method down at the caller (*try to inline
where the counter tripped into the caller) then that caller may have other
sends in it that one could profitably inline (e.g. if the basic block
counters said those sends were often executed).  One can also try and inline
sends made in the method where the counter tripped.  Visualise it like a
tree (the call graph) and you've got a counter tripped at some node.  One
searches for a subgraph of the tree guided by counts and heuristics to
prevent one looking at too much code, repeatedly looking at unoptimizable
code etc, and chooses to optimize the entire subgraph, mapping the current
execution to the right point in the supermethod one produces.  You can
traverse the graph towards the root by following the call stack and towardss
the leaves by following sends through PICs.  So one can indeed explore a
subgraph, not just a single path.

The trace-based thing looks like optimizing wearing blinkers to me.

best
Eliot?

>
> Colin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20101120/162381e7/attachment-0001.htm