Replacing Semaphores with Mesa Monitors

Wed May 12 22:36:27 UTC 2004

I would like to propose a change to the core synchronization 
construct used in Squeak.  I believe that we should switch from the 
Semaphore to the Mesa-style monitor with condition variables.

Note that I'm not just proposing that we _add_ Mesa-style monitors to 
the existing bevy of synchronization constructs: I could do that 
without getting any consensus from the community, and indeed we have 
implemented Mesa monitors in terms of Semaphores.  I'm proposing that 
we remove Semaphores entirely, and use Mesa-style monitors instead.

The basic reason for this proposal is that Semaphores are hard to 
use, and likely to be a source of synchronization bugs.  The idea of 
a Semaphore with a time-out is not very well defined: what is 
supposed to be true after one has been released from such a 
Semaphore?  Mesa-style monitors were developed by the folks at PARC 
and later at Digital SRC who developed Mesa, Modula-2+ and finally 
Modula-3, and have stood the test of time.  They are more expressive 
than semaphores, since they tie together the protected variables and 
the conditions for which processes are waiting, and they open the way 
for both compile-time and run-time checks.

What is a Mesa-style monitor?  Most of us are probably familiar with 
a Hoare-style monitor, in which a group of variables are protected 
from concurrent access by a monitor lock.  This is like a Java object 
in which all of the public methods are declared to be synchronized. 
I'm not at this point proposing any language changes, so it would be 
up to the programmer to put the bodies of the synchronized methods 
inside a monitor critical: [ block ], as is presently done with 
semaphores.

Once inside a monitor, what does a process do when it finds that it 
needs some resource that is not presently available?  It waits on a 
so-called condition variable, which is logically associated with the 
boolean condition for which it is waiting, such as data being present 
on a socket, a file being free, or a queue being non-empty or 
non-full. The difference between Hoare and Mesa-style monitors is in 
the way that condition variables are handled.

In both styles of Monitor, waiting on a condition variable entails 
_atomically_ releasing the monitor lock and suspending the current 
process.  (This can be easily implemented by a new VM primitive, or 
by using a counter and a semaphore) .  The main difference between 
the two styles lies in the semantics of a signal on a condition 
variable.   In the Hoare formulation, a signal can only be performed 
by a process that itself has the monitor lock, and which can 
therefore guarantee that the condition for which the waiting process 
is waiting is now true.  If there is no process waiting, the signal 
does nothing, but if there is a waiter (or several), the semantics of 
signal is to suspend the signaller and transfer control to the 
waiter(or one of the waiters).    In the Mesa formulation, signal is 
a _hint_: the signaller says that something may have changed, and the 
responsibility falls on the waiting process to check whether the 
condition for which it is waiting is satisfied, and if not, to wait 
again.  This may seem like a step backwards, but in fact it 
significantly simplifies the implementation, because it is always 
correct to signal a condition - it's a bit like a "self changed" 
message in the Observer pattern.  This means that the signaller does 
not have to hold the monitor lock, and that  interrupt handlers and 
hardware can therefore signal condition variables directly.  It is 
this property that makes it possible to use Mesa monitors as the 
_only_ synchronization construct.

The tradeoff is that the condition must always be re-checked after a 
wait.  The correct usage pattern is

	[ condition ] whileFalse: [ conditionVariable wait ]

and not (condition) ifFalse: [ conditionVariable wait ].

Chuan-Kai Lin (a student here at OGI) and I have developed an 
implementation of Monitors in terms of semaphores (which is sometimes 
a bit convoluted, to avoid race conditions).  We have some simple 
tests, but the implementation needs further testing and, more 
importantly, careful scrutiny by many eyes, before I would be 
confident that we have avoided lost signals and race conditions. 
However, I would be much happier with an implementation that bypasses 
Semaphores altogether, and has the kernel primitives operate directly 
on condition variables.  This would not only be more efficient (not 
an insignificant consideration when one is talking about interrupt 
handling), but also simpler and therefore less likely to contain 
hidden bugs.

My first attempts to achieve this involved subclassing Semaphore to 
create classes for monitor condition variables and monitor locks, but 
carefully leaving the layout of the instance variables the same so 
that the existing Seamphore primitives would still work. 
Nevertheless, the primitives failed.  Tim Rowledge tells me that this 
is because there is an explicit test of the class of the object (= 
Semaphore) in the implementation of <primitive: 86>.

So, in order to make progress on this, we need to either modify the 
implementation of the Semaphore primitives to remove this check, or 
to create some new (unchecked) primitives.  I've never played with 
kernel primitives before, and have no idea even how to start looking 
at doing this.

OK, I'm almost finished.  The final fly in the ointment is what to do 
when a waiting thread shouldn't wait any more.  Perhaps it is waiting 
for data on a socket, and the socket was closed.  Or perhaps it was 
waiting for user input, and the user just hit the interrupt key.  How 
does one deal with this?

The "pure" answer is: in the usual way.  You have to write your 
conditions so that they say "data available OR socket closed", etc. 
This is obviously correct, and often quite impractical.

The answer adopted by current Squeak code is to use timeouts on the 
wait.  The main problem with this (apart from implementing it 
correctly and efficiently; see Julian Fitzell's 
SemaphoreWaitAndWaitTimeoutFix-jf from 11 Sept 2003) is that the 
semantics of the wait message are no longer clear.  Even with the 
Hoare monitor, the meaning of being awoken is no longer that the 
condition for which you were waiting is satisfied: it might just be 
that a timer has gone off.    So, the awoken process has to check 
what has happened explicitly - and presumably going back to wait 
again is not the desired behavior if a timeout has occurred and the 
condition is still false.

The Modula-3 libraries solve this problem by introducing alerts, 
which are sent to processes.  The receiving process sees an "alerted" 
exception, or a flag set.  If a process is waiting on a condition 
that might be false for a long time (like user input), the wait 
should be an "alertWait", which means that if the process is alerted, 
the wait will be terminated.  Once the process is running again, 
inside the monitor, it will experience an alerted exception: it will 
not execute the normal code path.    I'm not a great fan of 
exceptions, so in Squeak I would instead suggest a method like:

	aCondition waitIfAlerted: [ aBlock ]

which will terminate normally if the condition is signalled, and will 
evaluate aBlock if it is alerted.

So the question for discussion on the list is: is this cleanup worth 
doing?  Is there a volunteer to work on the kernel primitives?

To help the discussion, I've assembled some resources on the Swiki: 
see http://minnow.cc.gatech.edu/squeak/Monitors.  The prototype 
implementation is there, as are links to various documents that 
provide more background than there is room for here.

	Andrew