[Vm-dev] Eliot's BlockClosure model questions

Fri Aug 2 11:04:17 UTC 2013

I tried with a simple jumpTo: over the block closure byte code (no push:
false, jumpFalse:) and it works fine with Cog . It does not result in an
important speed up but it saves 1 bytecode. It is true that this
implementation waste 5 bytecodes (4 for the pushClosure, 1 for the jump)
over the previous one I did, but it works without VM modification. There
are 6250 clean blocks so it wastes 12kb at worse. I guess it is fine then.

I will do a cleaner implementation and integrate in Pharo 3 in the next few
weeks (I will try before 15th August).

2013/8/2 Clément Bera <bera.clement at gmail.com>

> Hi Eliot.
>
> So I changed the implementation according to what you've just said and it
> works with Cog. I added a jump and a pushClosure byte code which is never
> called but permits to be JIT-compatible.
>
> exampleCleanBlock
> ^ [ 1  + 2 ]
>
> 17 <20> pushConstant: [...]
> 18 <72> pushConstant: false
> 19 <9F> jumpFalse: 28
> 20 <8F 00 00 04> closureNumCopied: 0 numArgs: 0 bytes 24 to 27
> 24  <76> pushConstant: 1
> 25  <77> pushConstant: 2
> 26  <B0> send: +
> 27  <7D> blockReturn
> 28 <7C> returnTop
>
> Here the BlockClosure in the literals has a startpc of 24, therefore the
> pushClosure bytecode cannot be called.
>
> I will try to replace the jumpFalse by a jump, I didn't do it because Opal
> then detects the block byte code as not reachable and removes it. I will
> then check if it still works with the JIT (I don't know if the JIT has
> these unreachable bytecode removal feature). I may earn some speed by not
> having to push false.
>
> Already now the clean block is definitely faster, at first look :
> OCOpalExamples >>#exampleCleanBlock
> ^ [ 1  + 2 ]
> foo := OCOpalExamples new.
> [ foo exampleCleanBlock ] bench (5x faster)
> [ foo exampleCleanBlock value ] bench (3.5 times faster)
>
> I can prepare you an image so you can have a look, but
> - Pharo 3 requires NativeBoost plugin to find environment variables so it
> may not work on your Cog builds
> - Pharo 3 is in alpha state which currently implies that the debugger is
> not stable
> - I need to clean it up before
> ...
>
> Anyway I'm happy to have it working.
>
>
>
> 2013/8/1 Eliot Miranda <eliot.miranda at gmail.com>
>
>>
>>
>> On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda <eliot.miranda at gmail.com>wrote:
>>
>>
>>>
>>> On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera <bera.clement at gmail.com>wrote:
>>>
>>>>
>>>> Hello Eliot,
>>>>
>>>> So I implemented clean blocks with Opal in Pharo 3. I didn't know where
>>>> to put the byte code of the clean block, so I put it at the end of the
>>>> method.
>>>>
>>>>  ex:
>>>> exampleCleanBlock
>>>> ^ [ 1  + 2 ]
>>>>
>>>> 17 <20> pushConstant: [...]
>>>> 18 <7C> returnTop
>>>> 19 <76> pushConstant: 1
>>>> 20 <77> pushConstant: 2
>>>> 21 <B0> send: +
>>>> 22 <7D> blockReturn
>>>>
>>>> having in the literal Array:
>>>> [ 1 + 2 ]
>>>> #exampleCleanBlock
>>>> OCOpalExamples
>>>>
>>>> The startpc of the block is 19.
>>>> Its outerContext is a context with nil as receiver and the method
>>>> OCOpalExamples>>#exampleCleanBlock.
>>>> Its numArgs is 0 and it has no copiedValues.
>>>>
>>>> But it does not work with the JIT.
>>>>
>>>
>> Thinking about it I'm pretty sure the problem is that the JIT scans for
>> and counts pushClosure: bytecodes to know how many blocks a method
>> contains, but clean blocks don't need pushClosure: bytecodes.  So the JIT
>> needs to look for clean blocks, e.g. either by scanning a method's literals
>> or by looking at the arguments of pushLiteral: bytecodes.  In any case the
>> image will allow me to develop a fix.
>>
>>
>>
>>>  If I run:
>>>> OCOpalExamples new exampleCleanBlock value
>>>> I got 3 all the time, it's fine. Now
>>>> 1 to: 5 do: [ :i |
>>>> OCOpalExamples new exampleCleanBlock value ]
>>>> Works on Stack VM, but crashes Cog VM. I don't know why (not enough
>>>> knowledge about the Cog JIT).
>>>>
>>>> Do you have any clue ?
>>>>
>>>
>>> no.  send me an image?
>>>
>>>
>>>>
>>>>
>>>>
>>>> 2013/7/31 Eliot Miranda <eliot.miranda at gmail.com>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera <bera.clement at gmail.com>wrote:
>>>>>
>>>>>>
>>>>>> Thanks for the answer it was very helpful. I got it now.
>>>>>>
>>>>>> I had a look at the first posts of your blog (Closures I & II) when I
>>>>>> was working on the Opal compiler. Today I was looking at Under Cover
>>>>>> Contexts and the Big Frame-Up<http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/> and
>>>>>> I think I should read all your blog.
>>>>>>
>>>>>> That is really nice that you wrote this blog it is the main
>>>>>> documentation about an efficient Smalltalk VM. I learnt by looking at Cog's
>>>>>> source mostly. VW VM source is closed so... I will have a look at
>>>>>> Strongtalk implementation instead it seems it is open source.
>>>>>>
>>>>>> Why are the clean blocks of VW much faster ? Are they activated like
>>>>>> method ? I didn't find it in your blog (probably because it is not in Cog).
>>>>>> Is it possible to implement clean blocks in Pharo/Squeak ? (I think that
>>>>>> 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it
>>>>>> worth it ?
>>>>>>
>>>>>
>>>>> Clean blocks are faster because they don't access their outer
>>>>> environment and hence their outer context does not have to be created.  So
>>>>> there is no allocation associated with a clean block.  It exists already as
>>>>> a literal and its outer context does not have to be reified.  Normal
>>>>> closures are created when the point at which they are defined in method
>>>>> execution is reached (the pushClosure bytecode) and if the current context
>>>>> does not yet exist that must be instantiated too, so creating a closure
>>>>> usually takes two allocations.
>>>>>
>>>>> Clean blocks are activated like blocks.  Block and method activation
>>>>> is different in the first phase (the send side) but quite similar in the
>>>>> second phase (frame building).  In VW for example, finding the machine code
>>>>> method associated with a block involves a cache lookup which can be slow.
>>>>>  In Cog, it involves following a pointer in the method header (inside, the
>>>>> VM replaces the header of a method with a pointer to its machine code) and
>>>>> then jumping to a hard-coded binary search which jumps to the correct
>>>>> block's entry-point depending on the closure's startpc.  If a method
>>>>> contains a single block then this is a direct jump.  As a result, block
>>>>> dispatch in Cog is typically faster than in VW.
>>>>>
>>>>> Yes, it is possible to implement clean blocks.  It is only an issue to
>>>>> do with the representation of closures.  Ideally they need a method inst
>>>>> var, making the outerContext inst var optional (or at least nil in a clean
>>>>> block).  But that would require a change to BlockClosure's class definition
>>>>> and a VM change.  To avoid having to change the class definition of
>>>>> BlockClosure and the VM, the compiler could create an empty context to hold
>>>>> onto the method, and that would work fine.  So to implement clean blocks
>>>>> the compiler would instantiate a BlockClosure literal for each clean block
>>>>> and a MethodContext whose receiver was nil shared between all the clean
>>>>> blocks in a method.  There are tricky issues such as setting breakpoints in
>>>>> methods (toggle break on entry), or copying methods, which would require
>>>>> scanning the literals for clean blocks and duplicating them and their
>>>>> outerCOntext too.  But that's just detail.  Some time I must try this for
>>>>> Squeak.  Let me know if you try if=t for Opal.  (and of course I'm very
>>>>> happy to help with advice).
>>>>>
>>>>> I expect that in certain cases the speedup would be noticeable, but it
>>>>> is a micro-optimization.  You'd of course only notice the difference in
>>>>> tight loops that used clean blocks.
>>>>>
>>>>>
>>>>> 2013/7/30 Eliot Miranda <eliot.miranda at gmail.com>
>>>>>>
>>>>>>>
>>>>>>> http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/
>>>>>>> Hi Clément,
>>>>>>>
>>>>>>> On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera <
>>>>>>> bera.clement at gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hello guys,
>>>>>>>>
>>>>>>>> I was looking recently at the blockClosure model of Eliot in
>>>>>>>> Pharo/Squeak and the blockClosure model of VisualWorks and I have a few
>>>>>>>> questions.
>>>>>>>>
>>>>>>>> - Why Pharo/Squeak does not have compiled block as in VW and has
>>>>>>>> the block byte code in the enclosing method ? Is it to save memory ? Would
>>>>>>>> it worth it to implement CompiledBlock in term of speed and memory
>>>>>>>> consumption ?
>>>>>>>>
>>>>>>>
>>>>>>> Squeak derives directly from the "blue book" Smalltalk-80
>>>>>>> implementation in which CompiledMethod is a hybrid object, half pointers
>>>>>>> (method header and literals) and half bytes (bytecode and source pointer).
>>>>>>>  This format was chosen to save space in the original 16-bit Smalltalk
>>>>>>> implementations on the Xerox D machines (Alto & Dorado).  VisualWorks has a
>>>>>>> few extra steps in between,  In ObjectWorks 2.4 and ObjectWorks 2.5 Peter
>>>>>>> Deutsch both introduced closures and eliminated the hybrid CompiledMethod
>>>>>>> format, introducing CompiledBlock.
>>>>>>>
>>>>>>> IMO adding CompiledBlock, while simplifying the VM a little would
>>>>>>> not improve performance, especially in the interpreter, essentially because
>>>>>>> activating and retuning form methods now requires an ecxtra level of
>>>>>>> indirection to get from the CompiledMethod object to its bytecodes in its
>>>>>>> bytecode object.
>>>>>>>
>>>>>>> However, adding CompiledBlock (or rather eliminating the hybrid
>>>>>>> CompiledMethod format) would definitely *not* save space.  The hybrid
>>>>>>> format is more compact (one less object per method).  One can try and
>>>>>>> improve this as in VisualWorks by encoding the bytecodes of certain methods
>>>>>>> as SmallIntegers in the literal frame, but this is only feasible in a pure
>>>>>>> JIT VM.  Squeak still has an interpreter, and Cog is a hybrid JIT and
>>>>>>> Interpreter.  In an interpreter it is costly in performance to be able to
>>>>>>> interpret this additional form of bytecodes.
>>>>>>>
>>>>>>> So IMO while the hybrid CompiledMethod isn't ideal it is acceptable,
>>>>>>> having important advantages to go along with its disadvantages.
>>>>>>>
>>>>>>>  - Why Pharo/Squeak context have this variable closureOrNil instead
>>>>>>>> of having the closure in the receiver field as in VW ? Is it an
>>>>>>>> optimization because there are a lot of access to self and instance
>>>>>>>> variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1
>>>>>>>> more slot per stack frame to have this.
>>>>>>>>
>>>>>>>
>>>>>>> I did this because I think its simpler and more direct.  I don't
>>>>>>> like VW's access to the receiver and inst vars having to use different
>>>>>>> bytecodes within a block to within a method.  There are lots of
>>>>>>> complexities resulting from this (e.g. in scanning code for inst var refs,
>>>>>>> the decompiler, etc).
>>>>>>>
>>>>>>> But in fact there isn't really an additional stack slot because the
>>>>>>> frame format in the VM does not use the stacked receiver (the 0'th
>>>>>>> argument) as accessing the receiver in this position requires knowing the
>>>>>>> method's argument count.  So in both methods and blocks the receiver is
>>>>>>> pushed on the stack immediately before allocating space for, and nilling,
>>>>>>> any temporaries.  This puts the receiver in a known place relative to the
>>>>>>> frame pointer, making it accessible to the bytecodes without having to know
>>>>>>> the method's argument count.  So the receiver always occurs twice on the
>>>>>>> stack in a method anyway.  In a block, the block is on the stack in the
>>>>>>> 0'th argument position.  The actual receiver is pushed after the temps.
>>>>>>>
>>>>>>> - Lastly, does VW have the tempVector optimization for escaping
>>>>>>>> write temporaries in their blockClosure ? It seems they have not (I don't
>>>>>>>> see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of
>>>>>>>> speed or memory with this optimization ?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, VW has this same organization.  I implemented it in VisualWorks
>>>>>>> 5i in ~ 2000.  It resulted in a significant increase in performance (for
>>>>>>> example, factors of two improvement in block-intensive code such as
>>>>>>> exception handling).  This is because of details in the context-to-stack
>>>>>>> mapping machinery which mean that if an activation of a closure can update
>>>>>>> the temporaries of its outer contexts then keeping contexts and stack
>>>>>>> frames in sync is much more complex and costly.  The 5i/Cog organization
>>>>>>> (which in fact derives from some Lisp implementations) results in much
>>>>>>> simpler context-to0stack mapping such that no tests need be done when
>>>>>>> returning from a method to keep frames and contexts in sync.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thank you for any answer.
>>>>>>>>
>>>>>>>
>>>>>>> You're most welcome.  Have you read my blog post on the design?  It
>>>>>>> is "Under Cover Contexts and the Big Frame-Up<http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/>",
>>>>>>> with additional information in "Closures Part I" & "Closures Part
>>>>>>> II – the Bytecodes<http://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/>
>>>>>>> ".
>>>>>>> --
>>>>>>> best,
>>>>>>> Eliot
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> best,
>>>>> Eliot
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> best,
>>> Eliot
>>>
>>
>>
>>
>> --
>> best,
>> Eliot
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20130802/9a9b41c9/attachment-0001.htm