[Vm-dev] A BlockClosure optimization

Sun May 16 04:14:10 UTC 2010

On Sat, May 15, 2010 at 7:16 PM, Igor Stasenko <siguctua at gmail.com> wrote:

>
> On 16 May 2010 04:33, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >
> >
> >
> > On Sat, May 15, 2010 at 11:02 AM, Igor Stasenko <siguctua at gmail.com>
> wrote:
> >>
> >> On 15 May 2010 20:12, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >> >
> >> >
> >> >
> >> > On Sat, May 15, 2010 at 10:02 AM, Igor Stasenko <siguctua at gmail.com>
> wrote:
> >> >>
> >> >> On 15 May 2010 19:35, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >> >> >
> >> >> > Hi Igor,
> >> >> > On Fri, May 14, 2010 at 10:14 PM, Igor Stasenko <
> siguctua at gmail.com> wrote:
> >> >> >>
> >> >> >> Hello,
> >> >> >>
> >> >> >> i just thought, that we could optimize a closure-copy mechanism
> >> >> >> to reuse a closure (BlockClosure instance), which were created
> before
> >> >> >> for same context.
> >> >> >
> >> >> >     that's a good one.  Also good is precomputing closures for
> blocks that don't capture their dynamic environment (don't close over any
> variables and don't include an ^-return; VW parlance "clean blocks").
>  Another one, but this requires a new bytecode set/vm is to not reify the
> current context for blocks that don't contain ^-returns (VW parlance
> "copying blocks").  But these last two should be preferences since they
> affect debugging (within a block so optimized one can't discover its
> origin).
> >> >> > (VW parlance for normal blocks is "full blocks"; all blocks in my
> closure compiler are full, so the current context must be reified, not an
> issue in the non-Cog VMs as its already there, but it is an issue in a
> faster VM, it often means two allocations instead of one).
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> A mechanism of optimization can illustrated by following code.
> >> >> >>
> >> >> >> Suppose , you having a method, which using a loop:
> >> >> >>
> >> >> >> myMethod
> >> >> >>
> >> >> >>  1 to: 100 do: [:i |
> >> >> >>   dict at: i ifAbsent: [ foo bar ] ]
> >> >> >>
> >> >> >> The idea is to copy a closure from method's literal frame just
> once,
> >> >> >> and store it into temp,
> >> >> >> and then reuse it like following:
> >> >> >>
> >> >> >> myMethod
> >> >> >> | closure |
> >> >> >>  1 to: 100 do: [:i |
> >> >> >>   dict at: i ifAbsent: (closure ifNil: [ closure := [ foo bar ] ]
> ) ]
> >> >> >>
> >> >> >> ----------
> >> >> >>
> >> >> >> A simple benchmark shows that we could gain from it:
> >> >> >>
> >> >> >> [ 1000000 timesRepeat: [ [ 1+1] value ] ] timeToRun
> >> >> >> 670
> >> >> >>
> >> >> >> [
> >> >> >> | closure |  closure := nil.
> >> >> >> 1000000 timesRepeat: [
> >> >> >>        (closure ifNil: [ closure := [ 1+1] ]) value ]
> >> >> >> ] timeToRun
> >> >> >> 518
> >> >> >>
> >> >> >> As you can see, even implemented in smalltalk (without any changes
> to
> >> >> >> VM) it shows
> >> >> >> a significant performance boost.
> >> >> >
> >> >> > That's what's nice about this optimization.  It doesn't require any
> VM modifications ;)
> >> >> >
> >> >>
> >> >> Yes, it doesn't.  A compiler can detect, if a closure push bytecode
> found
> >> >> in a loop (a bytecode block which having a backwards jump)
> >> >> and then add a temp and emit the bytecodes to (re)use that temp.
> >> >>
> >> >> A potential VM modification could be to add a bytecode like
> >> >> "pushBlockClosure: litIndex orReuseTemp: tempIndex ",
> >> >> which should be equivalent to given piece: closure ifNil: [ closure
> := [ 1+1] ].
> >> >>
> >> >> Too bad, this technique can't be used for a nested blocks, since they
> >> >> have to use different 'outerContext'
> >> >> per each activation of outer closure. :(
> >> >
> >> > Is that true?  Isn't the real issue that if any niladic block contains
> an ^-return then that block any any blocks it is nested within cannot be
> shared?  It would be good to try and state the invariants in formal English
> before you implement.  The implementation will be much better as a result.
> >>
> >> My English is not good enough to explain what i mean :)
> >>
> >> I can show this with a code:
> >>
> >> 1 to: 1000 do: [ :i |    dict at: i ifAbsent: [  | x |  x := self foo:
> >> i. dict2 at: i ifAbsent: [ x ] ] ]
> >>
> >> here, the outer closure "[  | x | ... ]" can use caching.
> >> While inner " [ x ] " can't, because its using a temp, created in an
> >> outer block.
> >> This is because each activation of outer block creating a new context.
> >>
> >> Sure thing, for blocks which don't refer to the state of outer
> >> context, it still should be possible.
> >> And even if nested block refers to the state of method's context (such
> >> as self or its temps):
> >>
> >> 1 to: 1000 do: [ :i |    dict at: i ifAbsent: [ dict2 at: i ifAbsent:
> >> [ self foo: i ] ] ]
> >>
> >> could be written as:
> >>
> >> | {i} closureInner closureOuter |
> >>
> >> closureInner := [self foo: i].
> >> closureOuter := [ dict2 at: i ifAbsent: closureInner ].
> >> 1 to: 1000 do: [ :i |  dict at: i ifAbsent: closureOuter ].
> >>
> >>
> >> yes, i know , that the above is incorrect code,
> >> but take into account that temp 'i' actually moved to method's scope,
> >> because #to:do: loop is inlined by compiler,
> >> hence i placed {i}  to indicate that.
> >>
> >>
> >> >>
> >> >> >>
> >> >> >> Of course, if we put something, which loads processor by real
> work,
> >> >> >> instead of just [1+1],
> >> >> >> the difference will be less significant.
> >> >> >>
> >> >> >> But apart from this, lying the fact, that copying closure object
> each
> >> >> >> time means memory allocation,
> >> >> >> and hence leads to more frequent GC.
> >> >> >
> >> >> > What real codes have you seen the costs in?  I think they're there
> (VisualWorks went to some effort to reduce costs using the two other
> optimizations I described) but how big?  In any case you should implement
> this and see whether any useful benchmarks (e.g. system recompilation) show
> measurable speedups.
> >> >>
> >> >> I never went to exploring a bytecode emission code in
> >> >> Encoder / MethodNode / BytecodeAgnosticMethodNode.
> >> >> It is hellishly complex and its easy to get lost there. :)
> >> >> But i could try.
> >> >
> >> > That's not what I meant.  I simply meant that you need to measure the
> effect of your optimization (caching blocks used in loops in a temp) on real
> use cases such as e.g. recompiling the entire system, which is a decent
> benchmark.
> >>
> >> Please, elaborate, what you mean then. Do you mean, manually rewrite
> >> the code in appropriate places
> >> in Compiler toolchain, where its possible?
> >
> > No  I mean measure how long it takes to recompile the system *without*
> your optimization, then recompile the system using your optimization,
> then measure how long it takes to recompile the system *with* your
> optimization.  i.e. I mean measure how much effect your optimization has on
> real benchmarks.  Then we can tell if it is worth using in general.
>
> right, but in order to "recompile the system using my optimization" i
> need to implement it first :)
>
> And it sounds like you believe that it can be easily added, if so ,
> then please give me a clues, where i should start from.
> Maybe i am looking at wrong place, or at same place (as a whole), but
> wrong section.
>

Well, it needs to be part of the closure analysis which starts in
BytecodeAgnosticMethodNode>>generate: where it sends self
ensureClosureAnalysisDone.  To understand the closure analysis read my blog
post<http://www.mirandabanda.org/cogblog/2008/07/24/closures-part-iii-the-compiler/>
and
the code.  The blog post is a little out of date now but things haven't
changed too much.  There's a long comment
in BlockNode>>analyseArguments:temporaries:rootNode:.  To avoid wrecking the
compiler while you work on it consider cloning it, using e.g.
Encoder>>globalSourceRanges (which is used
in Environment>>rewriteSourceForSelector:inClass:using:) to rename the
relevant globals.  You can see the use of fixTemp:
and BytecodeAgnosticMethodNode>>primitiveErrorVariableName for how to
allocate a temp (provided there are spares available).

There are other options (the NewCompiler, OMeta etc) but the above amy be
the most direct.

>
> > best
> > Eliot
> >>
>
> >
> >
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20100515/c72e9a9f/attachment.htm