[squeak-dev] Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Eliot Miranda eliot.miranda at gmail.com
Fri Jul 27 05:40:33 UTC 2018


Ha!

On Thu, Jul 26, 2018 at 9:31 PM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with
> strong experience in the VM with processes,
>
> On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <lewis at mail.msen.com>
> wrote:
>
>> Semaphore seems like a rather basic thing that should work correctly in
>> any Squeak image. The tests do not pass in trunk any more.
>>
>> Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but
>> passes
>> in the earlier Squeak 4.6 / 5.0 images.
>>
>> Is this a real problem? Does it need to be fixed for the 5.2 release?
>>
>
> Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message
> will strike you as TL;DR, but please, especially if you're Bert, Clément,
> Juan, Levente or Marcus, read this carefully.  It's quite important.  And
> below I'll present the Squeak code but will work with Clément and Marcus to
> implement semantically equivalent Pharo code asap.
>
> And apologies in advance for the repetitious nature of this message.  It
> is better that I am precise than I am brief and anyone miss anything.  This
> is an old problem and it will be nice if I've fixed it, but I could easily
> have missed something; this problem having been around for decades.  OK...
>
>
> This is an old problem which boiled down to there being no way to
> determine by looking at a process's suspendedContext whether a process is
> either waiting on a Semaphore or Mutex or is no longer waiting, but has
> made no progress because it is at a lower priority than runnable processes
> and so has not got a chance to run yet.
>
> So in
>     | s |
>     s := Semaphore new.
>     ...
>     s wait
>     ...
>
> if we look at the context containing the wait its pc will be the same
> whether the process is blocked, waiting on the semaphore, or whether the
> semaphore has been signalled but the process has not been able to proceed
> because it is of lower priority than runnable processes and so can make no
> progress.  This caused problems for code such as this:
>
> Semaphore>>critical: mutuallyExcludedBlock
> self wait.
> ^mutuallyExcludedBlock ensure: [self signal]
>
> because the ensure: won't be entered if higher priority runnable processes
> are preventing it from running.
>
> And for code such as this:
>
> Semaphore>>critical: mutuallyExcludedBlock
> ^[self wait.
> mutuallyExcludedBlock value]
> ensure: [self signal]
>
> because if the process is terminated when the semaphore has not been
> signalled (i.e. the process is blocked in the wait), Process>>terminate
> will run the ensure: block anyway, resulting in the Semaphore getting an
> extra signal.
>
> This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed
> Mutex as a more efficient version of Monitor, but this is also subject so
> the same problem.  We did change the definition of ensure: so that it is
> not a suspension point, by adding valueNoContextSwitch[:]
>
> BlockClosure>>ensure: aBlock
> "Evaluate a termination block after evaluating the receiver, regardless of
> whether the receiver's evaluation completes.  N.B.  This method is *not*
> implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
> 198 in a context's method as the mark for an ensure:/ifCurtailed:
> activation."
>
> | complete returnValue |
> <primitive: 198>
> returnValue := self valueNoContextSwitch.
> complete ifNil:[
> complete := true.
> aBlock value.
> ].
> ^ returnValue
>
> This means that we don't have to deal with suspensions here (marked with
> !!!)
>
> I now understand how to distinguish between the two cases, between
> blocking and not blocked but no progress.  Process>>suspend answers the
> list the Process was on when it was suspended.  If the process is already
> suspended Process>>suspend answers nil.  If the process is waiting on a
> Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And
> if the process is runnable then Process>>suspend answers the process's run
> list (a LinkedList in ProcessorScheduler's quiescentProcessLists array
> corresponding to the process's priority).
>
> So Process>>#terminate can distinguish between #wait or
> #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection
> being blocked, or being unblocked but having made no progress due to too
> low a priority.  We do so by testing the class of the result of suspending
> the process.  If it is a LinkedList, the process has past the #wait or #
> primitiveEnterCriticalSection but has made no progress due to too low a
> priority.
>
> The version of Process>>#terminate I'm about to commit deals with several
> cases.  Let me present the cases first.  There are three versions of
> Semaphore>>#critical: to handle, and one version of Mutex>>critical: and
> Mutex>>#critical:ifLocked:.
>
> The two basic versions of Semaphore>>critical: are
>
> V1
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> self wait.
> ^mutuallyExcludedBlock ensure: [self signal]
>
> V2
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> ^[self wait.
>   mutuallyExcludedBlock value]
> ensure: [self signal]
>
> and Juan's safer version is (after I added the criticalSection pragma)
>
> V3
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> | caught |
> "We need to catch eventual interruptions very carefully.
> The naive approach of just doing, e.g.,:
> self wait.
> aBlock ensure:[self signal].
> will fail if the active process gets terminated while in the wait.
> However, the equally naive:
> [self wait.
> aBlock value] ensure:[self signal].
> will fail too, since the active process may get interrupted while
> entering the ensured block and leave the semaphore signaled twice.
> To avoid both problems we make use of the fact that interrupts only
> occur on sends (or backward jumps) and use an assignment (bytecode)
> right before we go into the wait primitive (which is not a real send and
> therefore not interruptable either)."
>
> caught := false.
> ^[
> caught := true.
> self wait.
> mutuallyExcludedBlock value
> ] ensure: [ caught ifTrue: [self signal] ]
>
> and the Mutex>>critical:'s are
>
> critical: aBlock
> "Evaluate aBlock protected by the receiver."
> <criticalSection>
> ^self primitiveEnterCriticalSection
> ifTrue: [aBlock value]
> ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]
>
> critical: aBlock ifLocked: lockedBlock
> "Answer the evaluation of aBlock protected by the receiver.  If it is
> already in a critical
> section on behalf of some other process answer the evaluation of
> lockedBlock."
> <criticalSection>
> ^self primitiveTestAndSetOwnershipOfCriticalSection
> ifNil: [lockedBlock value]
> ifNotNil:
> [:alreadyOwner|
> alreadyOwner
> ifTrue: [aBlock value]
> ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]
>
> primitiveEnterCriticalSection answers false if the Mutex was unowned, and
> true if it was already owned by the active process.  It blocks otherwise.
> primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex
> was unowned, true if it was already owned by the active process, and nil if
> owned by some other process.
>
> So we want Process>>#terminate to correctly release the semaphores and
> mutexes no matter where in these methods they are.  We don't have to worry
> if the process is within the block argument to a critical: itself, only if
> it is actually within the critical: method or a block within it. If it is
> already within the block argument to critical: then Process>>#terminate's
> unwind handling will unwind things correctly.  Taking Juan's version of
> Semaphore>>#critical: above, the key issue is whether the process being
> terminated is blocked on the wait, not blocked but still stuck at the wait,
> or at the start of the block argument to ensure:.
>
> I have extracted the processing into Process>>releaseCriticalSection:, so
> now Process>>terminate reads
>
> terminate
> "Stop the process that the receiver represents forever.
> Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
> If the process is in the middle of a critical: critical section, release
> it properly."
>
> | ctxt unwindBlock oldList |
> self isActiveProcess ifTrue: [
> ctxt := thisContext.
> [ ctxt := ctxt findNextUnwindContextUpTo: nil.
> ctxt isNil
> ] whileFalse: [
> (ctxt tempAt: 2) ifNil:[
> ctxt tempAt: 2 put: nil.
> unwindBlock := ctxt tempAt: 1.
> thisContext terminateTo: ctxt.
> unwindBlock value].
> ].
> thisContext terminateTo: nil.
> self suspend.
> ] ifFalse:[
> "Always suspend the process first so it doesn't accidentally get woken up.
> N.B. If oldList is a LinkedList then the process is runnable. If it is a
> Semaphore/Mutex et al
> then the process is blocked, and if it is nil then the process is already
> suspended."
> oldList := self suspend.
> suspendedContext ifNotNil:
> ["Release any method marked with the <criticalSection> pragma.
>  The argument is whether the process is runnable."
> self releaseCriticalSection: (oldList isNil or: [oldList class ==
> LinkedList]).
>
> "If terminating a process halfways through an unwind, try to complete that
> unwind block first."
> (suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
> [:outer|
> (suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt:
> 1)]) ifNotNil:
> [:inner| "This is an unwind block currently under evaluation"
> suspendedContext runUntilErrorOrReturnFrom: inner]].
>
> ctxt := self popTo: suspendedContext bottomContext.
> ctxt == suspendedContext bottomContext ifFalse:
> [self debug: ctxt title: 'Unwind error during termination'].
> "Set the context to its endPC for the benefit of isTerminated."
> ctxt pc: ctxt endPC]]
>
> In implementing releaseCriticalSection: we need to know which selector a
> context has just sent.  selectorJustSentOrSelf is implemented in Squeak as
>
> InstructionStream>>selectorJustSentOrSelf
> "If this instruction follows a send, answer the send's selector, otherwise
> answer self."
>
> | method |
> method := self method.
> ^method encoderClass selectorToSendOrItselfFor: self in: method at: self
> previousPc
>
> c.f.
>
> InstructionStream>>selectorToSendOrSelf
> "If this instruction is a send, answer the selector, otherwise answer
> self."
>
> | method |
> method := self method.
> ^method encoderClass selectorToSendOrItselfFor: self in: method at: pc
>
> Now we can implement Process>>#releaseCriticalSection:
>
> releaseCriticalSection: runnable
> "Figure out if we are terminating a process that is in the ensure: block
> of a critical section.
> In this case, if the block has made progress, pop the suspendedContext so
> that we leave the
> ensure: block inside the critical: without signaling the semaphore/exiting
> the primitive section,
> since presumably this has already happened.  But if it hasn't made
> progress but is beyond the
> wait (which we can tell my the oldList being one of the runnable lists,
> i.e. a LinkedList, not a
> Semaphore or Mutex, et al), then the ensure: block needs to be run."
> | selectorJustSent |
> (suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
> selectorJustSent := suspendedContext selectorJustSentOrSelf.
>
> "Receiver and/or argument blocks of ensure: in Semaphore>>critical: or
> Mutex>>#critical:"
> suspendedContext isClosureContext ifTrue:
> [suspendedContext sender selector == #ensure: ifTrue:
> [| notWaitingButMadeNoProgress |
> "Avoid running the ensure: block twice, popping it if it has already been
> run. If runnable
> but at the wait, leave it in place. N.B. No need to check if the block
> receiver of ensure: has
> not started to run (via suspendedContext pc = suspendedContext startpc)
> because ensure:
> uses valueNoContextSwitch, and so there is no suspension point before the
> wait."
> notWaitingButMadeNoProgress :=
> runnable
> and: [selectorJustSent == #wait
> and: [suspendedContext sender selectorJustSentOrSelf ==
> #valueNoContextSwitch]].
> notWaitingButMadeNoProgress ifFalse:
> [suspendedContext := suspendedContext home]].
> ^self].
>
> "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still
> blocked?  If so, nothing further to do."
> runnable ifFalse: [^self].
>
> "If still at the wait the ensure: block has not been activated, so signal
> to restore."
> selectorJustSent == #wait ifTrue:
> [suspendedContext receiver signal].
>
> "If still at the lock primitive and the lock primitive just acquired
> ownership (indicated by it answering false)
> then the ensure block has not been activated, so explicitly
> primitiveExitCriticalSection to unlock."
> (selectorJustSent == #primitiveEnterCriticalSection
> or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection])
> ifTrue:
> [(suspendedContext stackPtr > 0
>  and: [suspendedContext top == false]) ifTrue:
> [suspendedContext receiver primitiveExitCriticalSection]]
>
>
> Let's go through it line by line.  First, runnable is an argument,
> determined in Process>>#terminate.  One could invoke it with
>
>     self releaseCriticalSection: oldList class == LinkedList
>
> but this means that an already suspended process is assumed to be not
> runnable, which makes it tricky to debug the Process>>#terminate method.
> One has to assign to oldList while stepping though the method.  I've chosen
> safety, assuming that the process is still runnable if suspend answers nil,
> its simply being debugged.
>
> Then we're only interested in <criticalSection> marked methods se we
> return if there's no such pragma.
>
> Then we deal with blocks in these methods.  One issue here is to avoid
> running the ensure: block twice if it is already being evaluated.  The
> other is to run it if it is stalled and has yet to be run.
>
> So if
>
> suspendedContext isClosureContext ifTrue:
>
> we're in the ensure: receiver or argument blocks in any <criticalSection>
> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].
> If wait was just sent then we're in the ensure: receiver block of
> Semaphore>>critical: (V2 & V3 above) and the issue is whether the process
> is blocked or is unblocked and has made no progress. If blocked then
> nothing needs to be done; the ensure: block is discarded and the stack cut
> back to the critical: activation.  If progress has been made then nothing
> needs to be done (in fact we can't be in this state; the ensure: receiver
> will have started evaluating the critical: block argument).  If unblocked,
> but no progress has been made, do /not/ discard the unwind block and it
> will be run in Process>>#terminate when this method returns.  Hence...
>
> [suspendedContext sender selector == #ensure: ifTrue:
> [| notWaitingButMadeNoProgress |
> "Avoid running the ensure: block twice, popping it if it has already been
> run. If runnable
> but at the wait, leave it in place. N.B. No need to check if the block
> receiver of ensure: has
> not started to run (via suspendedContext pc = suspendedContext startpc)
> because ensure:
> uses valueNoContextSwitch, and so there is no suspension point before the
> wait."
> notWaitingButMadeNoProgress :=
> runnable
> and: [selectorJustSent == #wait
> and: [suspendedContext sender selectorJustSentOrSelf ==
> #valueNoContextSwitch]].
> notWaitingButMadeNoProgress ifFalse:
> [suspendedContext := suspendedContext home]].
> ^self].
>
> Now we're left with the simpler version of Semaphore>>critical: (V1 above)
> and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only
> state we have to worry about is that the process is unblocked but has made
> no progress.  If not runnable the process is still blocked and we can
> simply return.
>
> "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still
> blocked?  If so, nothing further to do."
> runnable ifFalse: [^self].
>
> If #wait was just sent the process is in Semaphore>>#critical: and,
> because ensure: has not been sent we signal explicitly to restore the
> signal count:
>
> "If still at the wait the ensure: block has not been activated, so signal
> to restore."
> selectorJustSent == #wait ifTrue:
> [suspendedContext receiver signal].
>
> If either of primitiveEnterCriticalSection or
> primitiveTestAndSetOwnershipOfCriticalSection have just been sent then
> either the Mutex is already owned, in which case the ensure block is
> elsewhere in the colder part of the stack, or has just been owned, and
> because ensure: has not been sent we unlock explicitly to release the Mutex:
>
> "If still at the lock primitive and the lock primitive just acquired
> ownership (indicated by it answering false)
> then the ensure block has not been activated, so explicitly
> primitiveExitCriticalSection to unlock."
> (selectorJustSent == #primitiveEnterCriticalSection
> or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection])
> ifTrue:
> [(suspendedContext stackPtr > 0
>  and: [suspendedContext top == false]) ifTrue:
> [suspendedContext receiver primitiveExitCriticalSection]]
>
> So Pharoers can you read and say whether you think this is sane or not?
> If so, then we can kibbutz to write the Pharo version.
>
> Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?
> Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184
> reverts Semaphore>>#critical: to V1 above.
>
>
> P.S. Looking at V1 above it seems to me that there is an issue if the
> process is preempted in ensure: before sending valueNoContextSwitch:.  I'll
> try and write a test that advances a process to that precise point.  If
> that test fails I think we have to use V2 or V3, and V2 is clearly
> preferable.
>

Lovely.  I added the test (testSemaCriticalWaitInEnsure
& testMutexCriticalBlockedInEnsure) and it is V1 that works and V2 that
does not.  I think it best not to try and be too clever and fix terminate
and/or releaseCriticalSection: for this case.  We can simply stick with V1
for now.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20180726/afb5a2c6/attachment.html>


More information about the Squeak-dev mailing list