[squeak-dev] Two new curious Context/primitive questions :-)

Fri Jan 1 21:22:39 UTC 2021

Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <
Christoph.Thiede at student.hpi.uni-potsdam.de> wrote:

> Hi all, hi Eliot,
>
>
> new year, new simulation fun, and I have collected two new questions about
> the Context implementation which I'd love to get answered here:
>
>
> First, I was confused by the following:
>
> (BlockClosure >> #numArgs) primitive. "266"
> (Context >> #pc) primitive. "0"
> What makes Context so special that it cannot be compiled with quick
> accessor methods?
>

The gory details are here:
http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a
higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates
contexts for every method or block activation will inevitably run slowly.
There are two overheads here.  One is the basic overhead of allocating and
reclaiming the contexts that are created.  The other is the cost of moving
the outgoing receiver and arguments from the sending context to the new
context.  A fast VM must avoid these overheads, and the key solution is to
use a stack frame organization as is used in conventional language
implementations.  The key here is that with stack frames the receiver and
incoming arguments do not have to be moved; they stay where they are and a
new frame is built underneath them which includes them as the incoming
receiver and arguments; stack frames overlap outgoing and incoming
arguments.

But crucially, in addition the VM must still create contexts if and when
they are needed.  If not, this would no longer be a Smalltalk-80
implementation.  Now one could imagine a VM which converted stack frames to
contexts whenever a method or block activation was accessed using
thisContext (or thisContext sender etc which access contexts by following
the sdender field).  But that would be very slow.  Instead, the fastest
approach is to try and keep stack frames as stack frames for as long as
possible.  For this to work contexts have to be able to function as "proxy
objects" for stack frames.  The VM does indeed create contexts when one
uses thisContext, or even when one creates a block, but internally the
context is in a very special state, a state in which it points at a stack
frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and
"stable".  I like "single", "married", "widowed" and "divorced" (Peter's
scheme didn't have "widowed" which is the key to why my scheme is faster,
and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to
refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are
"married".
When a method or block returns its stack frame is discarded; the stack
frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are
"divorced" from their contexts.  The stack frame state is written to the
context object and the context object no longer acts as a proxy; it is a
fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the
stack zone.  The size of the stack zone is set at startup, and its size is
either the VM's defaut oir set by a value that persists in the image header
(see vmParameterAt: 43).  Since the system may run out of stack pages (a
deep call stack, many active processes), the VM makes new pages available
by purging the least recently used pages to the heap in the form of context
objects.  All the frames in the least recently used stack page are divorced
at the same time. A context must be created for each frame that is not yet
married.

(*) The reason why Peter's original scheme is slower is that at return time
the VM has to check for being married and copy state to the hybrid context,
making it stable.  This has to be done because a closure's outer temporary
variables are stored in the lexically enclosing stack frame, and so have to
be made to persist in the stable context or their values will be stale.  In
my (lisp-derived) closure implementation there are no such references to
outer temporary variables; values are either copied if they do not change,
or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free
to function as a context does.  Indeed when the system starts up it loads
an image which only contains single contexts.  All contexts are divorced on
snapshot.

So now, finally, we can answer your question.  Why is (Context>>#pc), a
method that simply answers an instance variable, not compiled as a quick
method with a primitive that answers that variable?  "married" and
"widowed" contexts are proxies for stack frames.  Only the instance
variables they contain which do not change can be fetched from the
context object itself; these are the receiver, arguments, closureOrNil, and
method. sender, pc, stackp and other stack contents may change as execution
proceeds. These mutable values must be fetched from the stack frame itself,
if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:])
must be mediated.  A married or widowed context has its sender field set to
the frame pointer of its stack frame, encoded as a SmallInteger.  You can
never see this from the image (except via a special xRay primitive). By
following the frame pointer encoded in the sender field the VM can detect
if the frame is still live.  If so, the relevant values are synthesized or
fetched from the stack frame itself.  If not, the context is widowed and
"mourns"; it is changed into a context with a nil sender field, a nil pc,
and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image
level accesses to Context inst vars are through primitives.  This feels
contrived to me.  Instead, inst var access is via inst var access
bytcodes.  But if we do this naively we would have to check in all inst var
access bytecodes whose index is the same as sender (0), pc (1) & stackp
(2).  This would slow down inst var access enormously.  Instead we observe
that inst var access bytecodes pushReceiverVariable:,
popIntoReceiverVariable: have short forms for the most frequently used
(indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that
we only use the long forms for accessing inst vars in Context, and
superclasses and subclasses, we only have to check in the long forms and
the check is cheap.  We check that the index is <= 2, because all other
inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick
primitives we do not have to duplicate this machinery in quick primitive
dispatch, which would also slow things down.  Hence Context>>pc is compiled
as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

> Second, Context >> #isPrimFailToken: attracted my attention multiple
> times when looking at different #timeProfile results of simulation
> sessions. In the expression [100000 factorial] it takes up more than 44% of
> the whole execution time!
> I have no idea why it could be so slow, but this message is sent really
> often, which is at least 2 times per simulation of a special message send.
> Would there be an easy change to resolve this bottleneck and speed
> up the simulation by 40% or more?
> Would it be possible to provide a primitive for this method? Or do you see
> any other way to optimize it?
>

It's not slow, so much as it has to be used all the time to check whether a
send has invoked a primitive which has failed.  Its function is to support
primitive error codes.  Without it we can simply check for nil.  But with
it we need a container that both reliably distinguishes a primitive failure
result from all other objects in the system, and holds onto the primitive
failure code.  It may indeed be able to reduce the overhead by only testing
when absolutely necessary.  It may be that reimplementing the scheme is
faster.

For example, one could imagine a class whose instances are the only objects
to answer #isPrimFailToken, which would be faster than a one argument
message send.  But beware, because if one was debugging the debugger one
mustn't confuse a PrimFailToken created in the simulation for a
PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the
context-to-stack mapping scheme outlined above was very costly, but its
value is very high.  It allows us to have contexts, with all their utility,
while reducing their costs significantly.  Similarly, and closely related,
being able to simulate execution is costly but hugely valuable.  If the
price is 40% of simulated execution then it is a price worth paying.

Best,
> Christoph
>

Happy New Year!

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210101/36a2eb66/attachment.html>