[Vm-dev] Solving termination of critical sections in the context of priority inversion was: [squeak-dev] SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Eliot Miranda eliot.miranda at gmail.com
Fri Jul 27 04:31:52 UTC 2018


Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with
strong experience in the VM with processes,

On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <lewis at mail.msen.com> wrote:

> Semaphore seems like a rather basic thing that should work correctly in
> any Squeak image. The tests do not pass in trunk any more.
>
> Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but
> passes
> in the earlier Squeak 4.6 / 5.0 images.
>
> Is this a real problem? Does it need to be fixed for the 5.2 release?
>

Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message
will strike you as TL;DR, but please, especially if you're Bert, Clément,
Juan, Levente or Marcus, read this carefully.  It's quite important.  And
below I'll present the Squeak code but will work with Clément and Marcus to
implement semantically equivalent Pharo code asap.

And apologies in advance for the repetitious nature of this message.  It is
better that I am precise than I am brief and anyone miss anything.  This is
an old problem and it will be nice if I've fixed it, but I could easily
have missed something; this problem having been around for decades.  OK...


This is an old problem which boiled down to there being no way to determine
by looking at a process's suspendedContext whether a process is either
waiting on a Semaphore or Mutex or is no longer waiting, but has made no
progress because it is at a lower priority than runnable processes and so
has not got a chance to run yet.

So in
    | s |
    s := Semaphore new.
    ...
    s wait
    ...

if we look at the context containing the wait its pc will be the same
whether the process is blocked, waiting on the semaphore, or whether the
semaphore has been signalled but the process has not been able to proceed
because it is of lower priority than runnable processes and so can make no
progress.  This caused problems for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
self wait.
^mutuallyExcludedBlock ensure: [self signal]

because the ensure: won't be entered if higher priority runnable processes
are preventing it from running.

And for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
^[self wait.
mutuallyExcludedBlock value]
ensure: [self signal]

because if the process is terminated when the semaphore has not been
signalled (i.e. the process is blocked in the wait), Process>>terminate
will run the ensure: block anyway, resulting in the Semaphore getting an
extra signal.

This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed
Mutex as a more efficient version of Monitor, but this is also subject so
the same problem.  We did change the definition of ensure: so that it is
not a suspension point, by adding valueNoContextSwitch[:]

BlockClosure>>ensure: aBlock
"Evaluate a termination block after evaluating the receiver, regardless of
whether the receiver's evaluation completes.  N.B.  This method is *not*
implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
198 in a context's method as the mark for an ensure:/ifCurtailed:
activation."

| complete returnValue |
<primitive: 198>
returnValue := self valueNoContextSwitch.
complete ifNil:[
complete := true.
aBlock value.
].
^ returnValue

This means that we don't have to deal with suspensions here (marked with
!!!)

I now understand how to distinguish between the two cases, between blocking
and not blocked but no progress.  Process>>suspend answers the list the
Process was on when it was suspended.  If the process is already suspended
Process>>suspend answers nil.  If the process is waiting on a Semaphore or
a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the
process is runnable then Process>>suspend answers the process's run list (a
LinkedList in ProcessorScheduler's quiescentProcessLists array
corresponding to the process's priority).

So Process>>#terminate can distinguish between #wait or
#primitiveEnterCriticalSection or
#primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being
unblocked but having made no progress due to too low a priority.  We do so
by testing the class of the result of suspending the process.  If it is a
LinkedList, the process has past the #wait or
#primitiveEnterCriticalSection but has made no progress due to too low a
priority.

The version of Process>>#terminate I'm about to commit deals with several
cases.  Let me present the cases first.  There are three versions of
Semaphore>>#critical: to handle, and one version of Mutex>>critical: and
Mutex>>#critical:ifLocked:.

The two basic versions of Semaphore>>critical: are

V1
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
self wait.
^mutuallyExcludedBlock ensure: [self signal]

V2
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
^[self wait.
  mutuallyExcludedBlock value]
ensure: [self signal]

and Juan's safer version is (after I added the criticalSection pragma)

V3
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
| caught |
"We need to catch eventual interruptions very carefully.
The naive approach of just doing, e.g.,:
self wait.
aBlock ensure:[self signal].
will fail if the active process gets terminated while in the wait.
However, the equally naive:
[self wait.
aBlock value] ensure:[self signal].
will fail too, since the active process may get interrupted while
entering the ensured block and leave the semaphore signaled twice.
To avoid both problems we make use of the fact that interrupts only
occur on sends (or backward jumps) and use an assignment (bytecode)
right before we go into the wait primitive (which is not a real send and
therefore not interruptable either)."

caught := false.
^[
caught := true.
self wait.
mutuallyExcludedBlock value
] ensure: [ caught ifTrue: [self signal] ]

and the Mutex>>critical:'s are

critical: aBlock
"Evaluate aBlock protected by the receiver."
<criticalSection>
^self primitiveEnterCriticalSection
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]

critical: aBlock ifLocked: lockedBlock
"Answer the evaluation of aBlock protected by the receiver.  If it is
already in a critical
section on behalf of some other process answer the evaluation of
lockedBlock."
<criticalSection>
^self primitiveTestAndSetOwnershipOfCriticalSection
ifNil: [lockedBlock value]
ifNotNil:
[:alreadyOwner|
alreadyOwner
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]

primitiveEnterCriticalSection answers false if the Mutex was unowned, and
true if it was already owned by the active process.  It blocks otherwise.
 primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex
was unowned, true if it was already owned by the active process, and nil if
owned by some other process.

So we want Process>>#terminate to correctly release the semaphores and
mutexes no matter where in these methods they are.  We don't have to worry
if the process is within the block argument to a critical: itself, only if
it is actually within the critical: method or a block within it. If it is
already within the block argument to critical: then Process>>#terminate's
unwind handling will unwind things correctly.  Taking Juan's version of
Semaphore>>#critical: above, the key issue is whether the process being
terminated is blocked on the wait, not blocked but still stuck at the wait,
or at the start of the block argument to ensure:.

I have extracted the processing into Process>>releaseCriticalSection:, so
now Process>>terminate reads

terminate
"Stop the process that the receiver represents forever.
Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
If the process is in the middle of a critical: critical section, release it
properly."

| ctxt unwindBlock oldList |
self isActiveProcess ifTrue: [
ctxt := thisContext.
[ ctxt := ctxt findNextUnwindContextUpTo: nil.
ctxt isNil
] whileFalse: [
(ctxt tempAt: 2) ifNil:[
ctxt tempAt: 2 put: nil.
unwindBlock := ctxt tempAt: 1.
thisContext terminateTo: ctxt.
unwindBlock value].
].
thisContext terminateTo: nil.
self suspend.
] ifFalse:[
"Always suspend the process first so it doesn't accidentally get woken up.
N.B. If oldList is a LinkedList then the process is runnable. If it is a
Semaphore/Mutex et al
then the process is blocked, and if it is nil then the process is already
suspended."
oldList := self suspend.
suspendedContext ifNotNil:
["Release any method marked with the <criticalSection> pragma.
 The argument is whether the process is runnable."
self releaseCriticalSection: (oldList isNil or: [oldList class ==
LinkedList]).

"If terminating a process halfways through an unwind, try to complete that
unwind block first."
(suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
[:outer|
(suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)])
ifNotNil:
[:inner| "This is an unwind block currently under evaluation"
suspendedContext runUntilErrorOrReturnFrom: inner]].

ctxt := self popTo: suspendedContext bottomContext.
ctxt == suspendedContext bottomContext ifFalse:
[self debug: ctxt title: 'Unwind error during termination'].
"Set the context to its endPC for the benefit of isTerminated."
ctxt pc: ctxt endPC]]

In implementing releaseCriticalSection: we need to know which selector a
context has just sent.  selectorJustSentOrSelf is implemented in Squeak as

InstructionStream>>selectorJustSentOrSelf
"If this instruction follows a send, answer the send's selector, otherwise
answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: self
previousPc

c.f.

InstructionStream>>selectorToSendOrSelf
"If this instruction is a send, answer the selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: pc

Now we can implement Process>>#releaseCriticalSection:

releaseCriticalSection: runnable
"Figure out if we are terminating a process that is in the ensure: block of
a critical section.
In this case, if the block has made progress, pop the suspendedContext so
that we leave the
ensure: block inside the critical: without signaling the semaphore/exiting
the primitive section,
since presumably this has already happened.  But if it hasn't made progress
but is beyond the
wait (which we can tell my the oldList being one of the runnable lists,
i.e. a LinkedList, not a
Semaphore or Mutex, et al), then the ensure: block needs to be run."
| selectorJustSent |
(suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
selectorJustSent := suspendedContext selectorJustSentOrSelf.

"Receiver and/or argument blocks of ensure: in Semaphore>>critical: or
Mutex>>#critical:"
suspendedContext isClosureContext ifTrue:
[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been
run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block
receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc)
because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the
wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf ==
#valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still
blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

"If still at the wait the ensure: block has not been activated, so signal
to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

"If still at the lock primitive and the lock primitive just acquired
ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly
primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection])
ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]


Let's go through it line by line.  First, runnable is an argument,
determined in Process>>#terminate.  One could invoke it with

    self releaseCriticalSection: oldList class == LinkedList

but this means that an already suspended process is assumed to be not
runnable, which makes it tricky to debug the Process>>#terminate method.
One has to assign to oldList while stepping though the method.  I've chosen
safety, assuming that the process is still runnable if suspend answers nil,
its simply being debugged.

Then we're only interested in <criticalSection> marked methods se we return
if there's no such pragma.

Then we deal with blocks in these methods.  One issue here is to avoid
running the ensure: block twice if it is already being evaluated.  The
other is to run it if it is stalled and has yet to be run.

So if

suspendedContext isClosureContext ifTrue:

we're in the ensure: receiver or argument blocks in any <criticalSection>
marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].
If wait was just sent then we're in the ensure: receiver block of
Semaphore>>critical: (V2 & V3 above) and the issue is whether the process
is blocked or is unblocked and has made no progress. If blocked then
nothing needs to be done; the ensure: block is discarded and the stack cut
back to the critical: activation.  If progress has been made then nothing
needs to be done (in fact we can't be in this state; the ensure: receiver
will have started evaluating the critical: block argument).  If unblocked,
but no progress has been made, do /not/ discard the unwind block and it
will be run in Process>>#terminate when this method returns.  Hence...

[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been
run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block
receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc)
because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the
wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf ==
#valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

Now we're left with the simpler version of Semaphore>>critical: (V1 above)
and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only
state we have to worry about is that the process is unblocked but has made
no progress.  If not runnable the process is still blocked and we can
simply return.

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still
blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

If #wait was just sent the process is in Semaphore>>#critical: and, because
ensure: has not been sent we signal explicitly to restore the signal count:

"If still at the wait the ensure: block has not been activated, so signal
to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

If either of primitiveEnterCriticalSection or
primitiveTestAndSetOwnershipOfCriticalSection have just been sent then
either the Mutex is already owned, in which case the ensure block is
elsewhere in the colder part of the stack, or has just been owned, and
because ensure: has not been sent we unlock explicitly to release the Mutex:

"If still at the lock primitive and the lock primitive just acquired
ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly
primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection])
ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]

So Pharoers can you read and say whether you think this is sane or not?  If
so, then we can kibbutz to write the Pharo version.

Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?
Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184
reverts Semaphore>>#critical: to V1 above.


P.S. Looking at V1 above it seems to me that there is an issue if the
process is preempted in ensure: before sending valueNoContextSwitch:.  I'll
try and write a test that advances a process to that precise point.  If
that test fails I think we have to use V2 or V3, and V2 is clearly
preferable.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180726/00d09a7e/attachment-0001.html>


More information about the Vm-dev mailing list