[squeak-dev] Two new curious Context/primitive questions :-)

Wed Jan 6 02:05:20 UTC 2021

This was a very interesting read - thank you Eliot! That should go in some
book :)

Another overview is in the class comment for
MaybeContextInstanceVariableNode. Which I found by trying to figure out how
the heck the compiler knows to use the long form for inst var refs in
Context.

Vanessa

On Fri, Jan 1, 2021 at 1:23 PM Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi Christoph,
>
>     happy new year!!
>
> On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <
> Christoph.Thiede at student.hpi.uni-potsdam.de> wrote:
>
>> Hi all, hi Eliot,
>>
>>
>> new year, new simulation fun, and I have collected two new questions
>> about the Context implementation which I'd love to get answered here:
>>
>>
>> First, I was confused by the following:
>>
>> (BlockClosure >> #numArgs) primitive. "266"
>> (Context >> #pc) primitive. "0"
>> What makes Context so special that it cannot be compiled with quick
>> accessor methods?
>>
>
> The gory details are here:
> http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/
>
> But that's full of implementation detail.  Let me try and explain at a
> higher level.
>
> Contexts are wonderful but very expensive.  Any VM that actually creates
> contexts for every method or block activation will inevitably run slowly.
> There are two overheads here.  One is the basic overhead of allocating and
> reclaiming the contexts that are created.  The other is the cost of moving
> the outgoing receiver and arguments from the sending context to the new
> context.  A fast VM must avoid these overheads, and the key solution is to
> use a stack frame organization as is used in conventional language
> implementations.  The key here is that with stack frames the receiver and
> incoming arguments do not have to be moved; they stay where they are and a
> new frame is built underneath them which includes them as the incoming
> receiver and arguments; stack frames overlap outgoing and incoming
> arguments.
>
> But crucially, in addition the VM must still create contexts if and when
> they are needed.  If not, this would no longer be a Smalltalk-80
> implementation.  Now one could imagine a VM which converted stack frames to
> contexts whenever a method or block activation was accessed using
> thisContext (or thisContext sender etc which access contexts by following
> the sdender field).  But that would be very slow.  Instead, the fastest
> approach is to try and keep stack frames as stack frames for as long as
> possible.  For this to work contexts have to be able to function as "proxy
> objects" for stack frames.  The VM does indeed create contexts when one
> uses thisContext, or even when one creates a block, but internally the
> context is in a very special state, a state in which it points at a stack
> frame.
>
> Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and
> "stable".  I like "single", "married", "widowed" and "divorced" (Peter's
> scheme didn't have "widowed" which is the key to why my scheme is faster,
> and why the closure implementation is as it is).
>
> When a stack frame is created it is "single".  It has a slot within it to
> refer to a context object if it is needed, but the slot is nil.
> When a stack frame needs a context object one is created and the two are
> "married".
> When a method or block returns its stack frame is discarded; the stack
> frame has died and if it was married its context becomes "widowed". (*)
> When space is needed for new stack frames all reclaimed stack frames are
> "divorced" from their contexts.  The stack frame state is written to the
> context object and the context object no longer acts as a proxy; it is a
> fully fledged normal stack object. (*)
>
> (*) Stack frames exist in a small memory zone organized as stack pages,
> the stack zone.  The size of the stack zone is set at startup, and its size
> is either the VM's defaut oir set by a value that persists in the image
> header (see vmParameterAt: 43).  Since the system may run out of stack
> pages (a deep call stack, many active processes), the VM makes new pages
> available by purging the least recently used pages to the heap in the form
> of context objects.  All the frames in the least recently used stack page
> are divorced at the same time. A context must be created for each frame
> that is not yet married.
>
> (*) The reason why Peter's original scheme is slower is that at return
> time the VM has to check for being married and copy state to the hybrid
> context, making it stable.  This has to be done because a closure's outer
> temporary variables are stored in the lexically enclosing stack frame, and
> so have to be made to persist in the stable context or their values will be
> stale.  In my (lisp-derived) closure implementation there are no such
> references to outer temporary variables; values are either copied if they
> do not change, or stored in an Array (the "indirection vector"), which can
> then be copied.
>
> Note that divorce is a process.  Once divorced a context is single and
> free to function as a context does.  Indeed when the system starts up it
> loads an image which only contains single contexts.  All contexts are
> divorced on snapshot.
>
>
> So now, finally, we can answer your question.  Why is (Context>>#pc), a
> method that simply answers an instance variable, not compiled as a quick
> method with a primitive that answers that variable?  "married" and
> "widowed" contexts are proxies for stack frames.  Only the instance
> variables they contain which do not change can be fetched from the
> context object itself; these are the receiver, arguments, closureOrNil, and
> method. sender, pc, stackp and other stack contents may change as execution
> proceeds. These mutable values must be fetched from the stack frame itself,
> if it still exists.
>
> All accesses to sender, pc, stackp, and stack contents
> (Context>>at:[put:]) must be mediated.  A married or widowed context has
> its sender field set to the frame pointer of its stack frame, encoded as a
> SmallInteger.  You can never see this from the image (except via a special
> xRay primitive). By following the frame pointer encoded in the sender field
> the VM can detect if the frame is still live.  If so, the relevant
> values are synthesized or fetched from the stack frame itself.  If not, the
> context is widowed and "mourns"; it is changed into a context with a nil
> sender field, a nil pc, and a stackp that points at the last argument.
>
> How is this mediation done?  One design could be to require that all image
> level accesses to Context inst vars are through primitives.  This feels
> contrived to me.  Instead, inst var access is via inst var access
> bytcodes.  But if we do this naively we would have to check in all inst var
> access bytecodes whose index is the same as sender (0), pc (1) & stackp
> (2).  This would slow down inst var access enormously.  Instead we observe
> that inst var access bytecodes pushReceiverVariable:,
> popIntoReceiverVariable: have short forms for the most frequently used
> (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that
> we only use the long forms for accessing inst vars in Context, and
> superclasses and subclasses, we only have to check in the long forms and
> the check is cheap.  We check that the index is <= 2, because all other
> inst var accesses with these indices will use the short-form bytecodes.
>
> By arranging that Context's inst vars are *not* accessed by quick
> primitives we do not have to duplicate this machinery in quick primitive
> dispatch, which would also slow things down.  Hence Context>>pc is compiled
> as a vanilla method and its inst var access bytecode is the long form:
>
> 25 <E2 01> pushRcvr: 1
> 27 <5C> returnTop
>
> 'Nuff said
>
>> Second, Context >> #isPrimFailToken: attracted my attention multiple
>> times when looking at different #timeProfile results of simulation
>> sessions. In the expression [100000 factorial] it takes up more than 44% of
>> the whole execution time!
>> I have no idea why it could be so slow, but this message is sent really
>> often, which is at least 2 times per simulation of a special message send.
>> Would there be an easy change to resolve this bottleneck and speed
>> up the simulation by 40% or more?
>> Would it be possible to provide a primitive for this method? Or do you
>> see any other way to optimize it?
>>
>
> It's not slow, so much as it has to be used all the time to check whether
> a send has invoked a primitive which has failed.  Its function is to
> support primitive error codes.  Without it we can simply check for nil.
> But with it we need a container that both reliably distinguishes a
> primitive failure result from all other objects in the system, and holds
> onto the primitive failure code.  It may indeed be able to reduce the
> overhead by only testing when absolutely necessary.  It may be that
> reimplementing the scheme is faster.
>
> For example, one could imagine a class whose instances are the only
> objects to answer #isPrimFailToken, which would be faster than a one
> argument message send.  But beware, because if one was debugging the
> debugger one mustn't confuse a PrimFailToken created in the simulation for
> a PrimFailToken created by the simulation.
> Remember to distinguish between price and value.  Implementing the
> context-to-stack mapping scheme outlined above was very costly, but its
> value is very high.  It allows us to have contexts, with all their utility,
> while reducing their costs significantly.  Similarly, and closely related,
> being able to simulate execution is costly but hugely valuable.  If the
> price is 40% of simulated execution then it is a price worth paying.
>
> Best,
>> Christoph
>>
>
> Happy New Year!
>
> _,,,^..^,,,_
> best, Eliot
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210105/c76c98d9/attachment.html>