[Vm-dev] Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Sat Nov 20 16:19:54 UTC 2010

Hi Igor,

On Sat, Nov 20, 2010 at 1:51 AM, Igor Stasenko <siguctua at gmail.com> wrote:

>
> On 20 November 2010 01:43, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >
> > Hi Igor,
> >
> > On Fri, Nov 19, 2010 at 2:50 PM, Igor Stasenko <siguctua at gmail.com>
> wrote:
> >>
> >> i think that with agressive optimization, this loop can be turned into
> no-op.
> >> or more presisely to an instruction which sets i to 100000001 :)
> >
> >
> > Are you missing the point, which is to try and generate efficient loop
> code, not to optimize empty loops?  I stripped the example down to the
> smallest possible code (an empty loop) just to show how much better VW code
> is and how limited the effect of merely doing the booleanCheat:.  But loops
> that /can't/ be optimized away are very common in the system, and right now
> Cog isn't doing a great job at making these go fast.
> >
> >> But i wonder what kind of analysis should be applied to determine that
> >> loop is bound
> >> to 100000000, and there is no other side effects than just incrementing
> counter.
> >> Btw, i think it is much easier to optimize original
> >> 1 to: 100000 do:
> >> since you know beforehead the loop bounds, and need only to analyze if
> >> loop body has any side effects.
> >
> > Imagine the loop had something meaningful in it such as Set>>do:.
> >>
> >> In that way, it is better to analyze the code at AST level, than on
> >> bytecode level, since once you turn it into bytecode,
> >> you losing a precious information that loop has bounds and have no
> >> choice but to strictly follow bytecode semantics.
> >> While of course, Cog have not much choice, since it can't operate with
> >> AST, just a bytecode.
> >
> > Yes, but you're missing the point.  One can still generate faster jitted
> code than Cog is producing now.  My message was trying to show how much more
> there is to be gained by a better jit code generator, not about higher-level
> optimization.
>
> Yes, there is a space for improvement.
> My point was, that you have even more potential for optimizations when
> dealing at AST level, not bytecode level.
> And one of my intents is to create a compiler which could transform
> AST directly to native code.
>
> For example, a loop bounds , since they are smallint constants, could
> be turned into machine integers, so code like following:
>
> 32 <76> pushConstant: 1
> a1ee2: movl $0x00000003, %eax : B8 03 00 00 00
> a1ee7: pushl %eax : 50
> 33 <B0> send: +
> a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
> a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00
> a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF
> IsSendCall:
> a1ef6: pushl %edx : 52
> 34 <68> popIntoTemp: 0
> a1ef7: popl %eax : 58
>
> could be replaced just by
>
> inc %eax
>
> and if loop counter passed somewhere, where it can't use optimized
> value, then it can be turned back to smallint, like:
>
> shl %eax,1
> inc %eax
>

Yes, I agree.  But I don't think that AST vs bytecode is really anything to
do with it; they can be easily transformed into each other (via decompiler &
compiler).  The bytecode is a convenient form because it is compact and can
efficiently be interpreted.  The issue is *when* and *where* to spend the
cycles trying to optimise aggressively.  That's where performance counters
come in.  If one decorates the jitted code with e.g. a taken and untaken
count at each conditional branch then when these counters trip one suspends
execution, examines the current call stack, collecting concrete type
information from inline caches, and optimises several nested activations
into a single large method that is worth optimising with traditional static
techniques (good register allocation etc).  If one tries to optimise
everything the system becomes unresponsive (see Craig Chambers' Self 2
compiler).  If one defers optimization until finding a "hot spot" things
work much better (see Urs Höltzle's Self 3 compiler, HotSpot et al).

So

- keep bytecode and an interpreter for compactness, portability and the
ability to always fall back on the interpreter (e.g. when the JIT runs out
of memory during some tricky relinking operation)

- use a simple JIT to optimize code run more than once, that does a
reasonable job of stack to register mapping, implements PICs to collect type
info and performance counters to collect block usage and invoke the
aggressive optimizer

- use a speculative inliner and an aggressive optimiser to inline code based
on hot spots, basic block counts, and PIC info, and optimize it using
traditional techniques.

All of the above exists in various production VMs, AFAIA none all in the
same place.  So the above is arguably a proven architecture.  Hence it is my
direction (and Marcus' and I hope yours too).  I can send you my
architectural sketch if you're interested.

best
Eliot

>
>
> > best
> > Eliot
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20101120/735776af/attachment.htm