[squeak-dev] Process #suspend / #resume semantics

Eliot Miranda eliot.miranda at gmail.com
Tue Dec 28 21:41:47 UTC 2021


Hi Jaromir, Hi Levente, Hi Vanessa, Hi All,

On Tue, Dec 28, 2021 at 11:55 AM <mail at jaromir.net> wrote:

> Hi Eliot,
>
> Thanks! Please see my comments below, it seems to me there may be a bug in
> the Mutex.
>

Executive summary; AFAICT it is due to the bug in the suspend primitive.
Specifics below, including two proposed fixes that need community review.
[look for "Should we change primitiveSuspend to the above?" and "So let's
look at the alternative"]

~~~
> ^[^    Jaromir
>
> Sent from Squeak Inbox Talk
>
> On 2021-12-27T14:55:22-08:00, eliot.miranda at gmail.com wrote:
>
> > Hi Jaromir,
> >
> > On Mon, Dec 27, 2021 at 2:52 AM <mail at jaromir.net> wrote:
> >
> > > Hi all,
> > >
> > > What is the desirable semantics of resuming a previously suspended
> process?
> > >
> >
> > That a process continue exactly as it had if it had not been suspended in
> > the first place.  In this regard our suspend is hopelessly broken for
> > processes that are waiting on condition variables. See below.
> >
> >
> > >
> > > #resume's comment says: "Allow the process that the receiver
> represents to
> > > continue. Put the receiver in *line to become the activeProcess*."
> > >
> > > The side-effect of this is that a terminating process can get resumed
> > > (unless suspendedContext is set to nil - see test KernelTests-jar.417 /
> > > Inbox - which has the unfortunate side-effect of #isTerminated answer
> true
> > > during termination).
> > >
> >
> > But a process that is terminating should not be resumable.  This should
> be
> > a non-issue.  If a process is terminating itself then it is the active
> > process, it has nil as its suspendedContext, and Processor
> > activeProcess resume always produces an error.. Any process that is not
> > terminating itself can be made to fail by having the machinery set the
> > suspendedContext to nil.
> >
>
> Yes agreed, but unfortunately that's precisely what is not happening in
> the current and previous #terminate and what I'm proposing in
> Kernel-jar.1437 - to set the suspendedContext to nil during termination,
> even before calling #releaseCriticalSection.
>
> >
> > > A similar side-effect: a process originally waiting on a semaphore and
> > > then suspended can be resumed into the runnable state and get
> scheduled,
> > > effectively escaping the semaphore wait.
> > >
> >
> > Right,  This is the bug.  So for example
> >     | s p |
> >     s *:=* Semaphore new.
> >     p *:=* [s wait] newProcess.
> >     p resume.
> >     Processor yield.
> >     { p. p suspend }
> >
> > answers an Array of process p that is past the wait, and the semaphore,
> s.
> > And
> >
> >     | s p |
> >     s *:=* Semaphore new.
> >     p *:=* [s wait] newProcess.
> >     p resume.
> >     Processor yield.
> >     p suspend; resume.
> >     Processor yield.
> >     p isTerminated
> >
> > answers true, whereas in both cases the process should remain waiting on
> > the semaphore.
> >
> > >
> > > Is this an expected behavior or a bug?
> > >
> >
> > IMO it is a dreadful bug.
> >
> > > If a bug, should a suspended process somehow remember its previous
> state
> > > and/or queue and return to the same one if resumed?
> > >
> >
> > IMO the primitive should back up the process to the
> > wait/primitiveEnterCriticalSection. This is trivial to implement in the
> > image, but is potentially non-atomic.  It is perhaps tricky to implement
> in
> > the VM, but will be atomic.
> >
> > Sorry if I'm missing something :)
> > >
> >
> > You're not missing anything :-)  Here's another example that answers two
> > processes which should both block but if resumed both make progress.
> >
> >     | s p1 p2 m |
> >     s *:=* Semaphore new.
> >     m *:=* Mutex new.
> >     p1 *:=* [m critical: [s wait]] newProcess.
> >     p1 resume.
> >     p2 *:=* [m critical: [s wait]] newProcess.
> >     p2 resume.
> >     Processor yield.
> >     { p1. p1 suspend. p2. p2 suspend }
> >
> > p1 enters the mutex's critical section, becoming the mutex's owner. p2
> then
> > blocks attempting to enter m's critical section.  Let's resume these two,
> > and examine the semaphore and mutex:
> >
> >     | s p1 p2 m |
> >     s *:=* Semaphore new.
> >     m *:=* Mutex new.
> >     p1 *:=* [m critical: [s wait]] newProcess.
> >     p1 resume.
> >     p2 *:=* [m critical: [s wait]] newProcess.
> >     p2 resume.
> >     Processor yield.
> >     { p1. p1 suspend. p2. p2 suspend }.
> >     p1 resume. p2 resume.
> >     Processor yield.
> >     { s. m. p1. p1 isTerminated. p2. p2 isTerminated }
> >
> > In this case the end result for p2 is accidentally correct. It ends up
> > waiting on s within m's critical section. But p1 ends up terminated.  IMO
> > the correct result is that p1 remains waiting on s, and is still the
> owner
> > of m, and p2 remains blocked trying to take ownership of m.
> >
>
> Perfect example! My naive expectation was when a process inside a critical
> section gets suspended the Mutex gets unlocked but that's apparently wrong
> :)
>

suspend merely stops the receiver running, makes it unrunnable.  Here's the
StackInterpreter's version of the primitive:

*primitiveSuspend*
    "Primitive. Suspend the receiver, aProcess such that it can be executed
again
    by sending #resume. If the given process is not currently running, take
it off
    its corresponding list. The primitive returns the list the receiver was
previously on."
    | process myList |
    process *:=* self stackTop.
    process = self activeProcess ifTrue:
        [self pop: 1 thenPush: objectMemory nilObject.
         ^self transferTo: self wakeHighestPriority].
    myList *:=* objectMemory fetchPointer: MyListIndex ofObject: process.
    "XXXX Fixme. We should really check whether myList is a kind of
LinkedList or not
    but we can't easily so just do a quick check for nil which is the most
common case."
    myList = objectMemory nilObject ifTrue:
        [^self primitiveFailFor: PrimErrBadReceiver].
    "Alas in Spur we need a read barrier"
    (objectMemory isForwarded: myList) ifTrue:
        [myList *:=* objectMemory followForwarded: myList.
         objectMemory storePointer: MyListIndex ofObject: process withValue:
 myList].
    self removeProcess: process fromList: myList.
    self successful ifTrue:
        [objectMemory storePointerUnchecked: MyListIndex ofObject: process
withValue: objectMemory nilObject.
         self pop: 1 thenPush: myList]

There are two interpretations possible here.  One is that suspend should
fail if attempted on a process that is waiting on a condition variable.
Another is that if the process is waiting on a condition variable, then
suspend should back up the process so that on resumption the wait is
retried.  The first is simple.  The second is horribly complicated (backing
up execution is difficult; the process could have been added to the
condition variable's list, and must be removed).  I'm interested in which
of these people think we should implement.  If just failure, things are
easy :-)  Anyway, let's understand the bug.

The big issue is with the lines
    myList *:=* objectMemory fetchPointer: MyListIndex ofObject: process.
    "XXXX Fixme. We should really check whether myList is a kind of
LinkedList or not
    but we can't easily so just do a quick check for nil which is the most
common case."
    myList = objectMemory nilObject ifTrue:
        [^self primitiveFailFor: PrimErrBadReceiver].

This should really check for the process's myList being its run queue, the
linked list in the processor scheduler's runnableProcesses array. But it
doesn't.  It only checks for nil.  In our case p2 is waiting on m, so its
list is m, and not nil. Hence the primitive succeeds. It then continues to
remove p2 from m, and set its list to nil:

    self removeProcess: process fromList: myList.
    self successful ifTrue:
        [objectMemory storePointerUnchecked: MyListIndex ofObject: process
withValue: objectMemory nilObject.
         self pop: 1 thenPush: myList]

So now when p1 is terminated and m releases its critical section there is
no p2 waiting on it that m can schedule to take ownership.  Hence p2 does
not take ownership of m.

Now, if we're going to fix this, one easy way is for the suspend primitive
to insist that a process's myList is its run queue for suspend to succeed.
If primitiveSuspend can rely on the process's priority inst var then the
check is easy, something like:

*primitiveSuspend*
    "Primitive. Suspend the receiver, aProcess such that it can be executed
again
    by sending #resume. If the given process is not currently running, take
it off
    its corresponding list. The primitive returns the list the receiver was
previously on."
    | process myList myRunQueue |
    process *:=* self stackTop.
    process = self activeProcess ifTrue:
        [self pop: 1 thenPush: objectMemory nilObject.
         ^self transferTo: self wakeHighestPriority].
    myList *:=* objectMemory fetchPointer: MyListIndex ofObject: process.
    myRunQueue *:=* objectMemory
                        fetchPointer: (objectMemory fetchInteger:
PriorityIndex ofObject: process) - 1
                        ofObject: (objectMemory fetchPointer:
ProcessListsIndex ofObject: self schedulerPointer).
    myList = myRunQueue ifTrue:
        [^self primitiveFailFor: PrimErrBadReceiver].
    "Alas in Spur we need a read barrier"
    (objectMemory isForwarded: myList) ifTrue:
        [myList *:=* objectMemory followForwarded: myList.
         objectMemory storePointer: MyListIndex ofObject: process withValue:
 myList].
    self removeProcess: process fromList: myList.
    self successful ifTrue:
        [objectMemory storePointerUnchecked: MyListIndex ofObject: process
withValue: objectMemory nilObject.
         self pop: 1 thenPush: myList]

We were talking about manipulating priority in another thread. For the
above to work we have to keep priority and myList in sync when
manipulating the priority of a runnable process.  For me this is not a big
issue; manipulating priority directly counts as shooting oneself in the
foot. So with that said, the important questions are

Should we change primitiveSuspend to the above? i.e. have primitiveSuspend
fail unless it is suspending a runnable process.
If so, is this optional behaviour? i.e. should the VM maintain a flag,
alongside preemptionYields, which if set (as it is by default) obtains the
old behaviour, an dif clear ob tains the new (correct) behaviour of failing
if waiting on condition variables.


So let's look at the alternative; having suspend succeed if the process is
waiting on a condition variable, but arranging somehow that the process
remains in the wait state.  This is preferrable (because it doesn't
introduce a new error into the system), but is apparently more complex.
So...

The primitive could alternatively succeed, but
- not remove process from myList unless myList is its run queue.  i.e. if
waiting on a condition variable
- have primitiveResume *not* resume a process whose myList is not its
runQueue.

I think this works.  The requirement is that primitiveResume check the
process's myList.  If myList is a runQueue then resume functions as
normal.  If myList is not a run queue (presumably a condition variable)
then resume does nothing.o  That's simple.


So how to determine if a process's myList is a run queue.  One way is to
have the VM maintain the class LinkedList.  LinkedList is not in the
specialObjectsArray; Semaphore is.  One way is to assume Semaphore always
inherits directly from LinkedList (safe assumption) and have the VM derive
LinkedList by Semaphore superclass (objectMemory fetchPointer:
SuperclassIndex ofObject: (objectMemory splObj: ClassSemaphoreIndex)).  But
this is not a particularly strong test.  One might legitimately construct a
list of suspended processes on a LinkedList that was not a run queue.  So
safest is simply to compare myList against the list at the process's
priority:

    myList *:=* objectMemory fetchPointer: MyListIndex ofObject: process.
    myRunQueue *:=* objectMemory
                        fetchPointer: (objectMemory fetchInteger:
PriorityIndex ofObject: process) - 1
                        ofObject: (objectMemory fetchPointer:
ProcessListsIndex ofObject: self schedulerPointer).
    myList = myRunQueue ifTrue:...

I expect this adds negligible overhead to suspend and resume.

So, sorry about the level of detail, but could people carefully review
these two proposals?


> But still, there's something wrong with the example: If p1 resumes it
> releases m's ownership and terminates, then p2 takes over and proceeds
> inside the critical section and gets blocked at the semaphore. I'd expect
> p2 would become the owner of the Mutex m BUT it's not! There's no owner
> while p2 is sitting at the semaphore. Try:
>
>     | s p1 p2 m |
>     s := Semaphore new.
>     m := Mutex new.
>     p1 := [m critical: [s wait]] newProcess.
>     p1 resume.
>     p2 := [m critical: [s wait]] newProcess.
>     p2 resume.
>     Processor yield.
>     { p1. p1 suspend. p2. p2 suspend }.
>     p1 resume. p2 resume.
>     Processor yield.
>     { s. m. p1. p1 isTerminated. p2. p2 isTerminated. m isOwned. m
> instVarNamed: 'owner' }
>
> It seems to me that when p2 gets suspended it is stopped somewhere inside
> #primitiveEnterCriticalSection before the owner is set and when it gets
> resumed it is placed into the runnable queue with the pc pointing right
> behind the primitive and so when it runs it just continues inside #critical
> and get blocked at the semaphore, all without having the ownership.
>
> Is this interpretation right? It would mean Mutex's critical section can
> be entered twice via this mechanism...
>
> Cuis does set the ownership to p2 in this example.
>
> Thanks again,
>
> Jaromir
> >
> > >
> > > Best,
> > > ~~~
> > > ^[^    Jaromir
> > >
> > > Sent from Squeak Inbox Talk
> > >
> >
> > _,,,^..^,,,_
> > best, Eliot
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20211227/8719df13/attachment.html
> >
> >
> >
>


-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20211228/3c26bb4d/attachment.html>


More information about the Squeak-dev mailing list