[Vm-dev] A BlockClosure optimization

Sun May 16 01:33:44 UTC 2010

On Sat, May 15, 2010 at 11:02 AM, Igor Stasenko <siguctua at gmail.com> wrote:

>
> On 15 May 2010 20:12, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >
> >
> >
> > On Sat, May 15, 2010 at 10:02 AM, Igor Stasenko <siguctua at gmail.com>
> wrote:
> >>
> >> On 15 May 2010 19:35, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >> >
> >> > Hi Igor,
> >> > On Fri, May 14, 2010 at 10:14 PM, Igor Stasenko <siguctua at gmail.com>
> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> i just thought, that we could optimize a closure-copy mechanism
> >> >> to reuse a closure (BlockClosure instance), which were created before
> >> >> for same context.
> >> >
> >> >     that's a good one.  Also good is precomputing closures for blocks
> that don't capture their dynamic environment (don't close over any variables
> and don't include an ^-return; VW parlance "clean blocks").  Another one,
> but this requires a new bytecode set/vm is to not reify the current context
> for blocks that don't contain ^-returns (VW parlance "copying blocks").  But
> these last two should be preferences since they affect debugging (within a
> block so optimized one can't discover its origin).
> >> > (VW parlance for normal blocks is "full blocks"; all blocks in my
> closure compiler are full, so the current context must be reified, not an
> issue in the non-Cog VMs as its already there, but it is an issue in a
> faster VM, it often means two allocations instead of one).
> >> >
> >> >
> >> >>
> >> >> A mechanism of optimization can illustrated by following code.
> >> >>
> >> >> Suppose , you having a method, which using a loop:
> >> >>
> >> >> myMethod
> >> >>
> >> >>  1 to: 100 do: [:i |
> >> >>   dict at: i ifAbsent: [ foo bar ] ]
> >> >>
> >> >> The idea is to copy a closure from method's literal frame just once,
> >> >> and store it into temp,
> >> >> and then reuse it like following:
> >> >>
> >> >> myMethod
> >> >> | closure |
> >> >>  1 to: 100 do: [:i |
> >> >>   dict at: i ifAbsent: (closure ifNil: [ closure := [ foo bar ] ] ) ]
> >> >>
> >> >> ----------
> >> >>
> >> >> A simple benchmark shows that we could gain from it:
> >> >>
> >> >> [ 1000000 timesRepeat: [ [ 1+1] value ] ] timeToRun
> >> >> 670
> >> >>
> >> >> [
> >> >> | closure |  closure := nil.
> >> >> 1000000 timesRepeat: [
> >> >>        (closure ifNil: [ closure := [ 1+1] ]) value ]
> >> >> ] timeToRun
> >> >> 518
> >> >>
> >> >> As you can see, even implemented in smalltalk (without any changes to
> >> >> VM) it shows
> >> >> a significant performance boost.
> >> >
> >> > That's what's nice about this optimization.  It doesn't require any VM
> modifications ;)
> >> >
> >>
> >> Yes, it doesn't.  A compiler can detect, if a closure push bytecode
> found
> >> in a loop (a bytecode block which having a backwards jump)
> >> and then add a temp and emit the bytecodes to (re)use that temp.
> >>
> >> A potential VM modification could be to add a bytecode like
> >> "pushBlockClosure: litIndex orReuseTemp: tempIndex ",
> >> which should be equivalent to given piece: closure ifNil: [ closure := [
> 1+1] ].
> >>
> >> Too bad, this technique can't be used for a nested blocks, since they
> >> have to use different 'outerContext'
> >> per each activation of outer closure. :(
> >
> > Is that true?  Isn't the real issue that if any niladic block contains an
> ^-return then that block any any blocks it is nested within cannot be
> shared?  It would be good to try and state the invariants in formal English
> before you implement.  The implementation will be much better as a result.
>
> My English is not good enough to explain what i mean :)
>
> I can show this with a code:
>
> 1 to: 1000 do: [ :i |    dict at: i ifAbsent: [  | x |  x := self foo:
> i. dict2 at: i ifAbsent: [ x ] ] ]
>
> here, the outer closure "[  | x | ... ]" can use caching.
> While inner " [ x ] " can't, because its using a temp, created in an
> outer block.
> This is because each activation of outer block creating a new context.
>
> Sure thing, for blocks which don't refer to the state of outer
> context, it still should be possible.
> And even if nested block refers to the state of method's context (such
> as self or its temps):
>
> 1 to: 1000 do: [ :i |    dict at: i ifAbsent: [ dict2 at: i ifAbsent:
> [ self foo: i ] ] ]
>
> could be written as:
>
> | {i} closureInner closureOuter |
>
> closureInner := [self foo: i].
> closureOuter := [ dict2 at: i ifAbsent: closureInner ].
> 1 to: 1000 do: [ :i |  dict at: i ifAbsent: closureOuter ].
>
>
> yes, i know , that the above is incorrect code,
> but take into account that temp 'i' actually moved to method's scope,
> because #to:do: loop is inlined by compiler,
> hence i placed {i}  to indicate that.
>
>
> >>
> >> >>
> >> >> Of course, if we put something, which loads processor by real work,
> >> >> instead of just [1+1],
> >> >> the difference will be less significant.
> >> >>
> >> >> But apart from this, lying the fact, that copying closure object each
> >> >> time means memory allocation,
> >> >> and hence leads to more frequent GC.
> >> >
> >> > What real codes have you seen the costs in?  I think they're there
> (VisualWorks went to some effort to reduce costs using the two other
> optimizations I described) but how big?  In any case you should implement
> this and see whether any useful benchmarks (e.g. system recompilation) show
> measurable speedups.
> >>
> >> I never went to exploring a bytecode emission code in
> >> Encoder / MethodNode / BytecodeAgnosticMethodNode.
> >> It is hellishly complex and its easy to get lost there. :)
> >> But i could try.
> >
> > That's not what I meant.  I simply meant that you need to measure the
> effect of your optimization (caching blocks used in loops in a temp) on real
> use cases such as e.g. recompiling the entire system, which is a decent
> benchmark.
>
> Please, elaborate, what you mean then. Do you mean, manually rewrite
> the code in appropriate places
> in Compiler toolchain, where its possible?
>

No  I mean measure how long it takes to recompile the system *without* your
optimization, then recompile the system using your optimization,
then measure how long it takes to recompile the system *with* your
optimization.  i.e. I mean measure how much effect your optimization has on
real benchmarks.  Then we can tell if it is worth using in general.

best
Eliot

> >>
> >> >>
> >> >> --
> >> >> Best regards,
> >> >> Igor Stasenko AKA sig.
> >>
> >>
> >> --
> >> Best regards,
> >> Igor Stasenko AKA sig.
> >
> >
> >
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20100515/84e89fa9/attachment.htm