I believe the problem here is that flushAllSuchThat: is now waiting on readSynch, but the signal it needs is waiting on accessProtect (pending nextPut:) before doing the readSynch signal.
Classic deadlock situation.
What a bear. It required an extra cup of coffee to figure out how SharedQueue is even trying to work, much less how to fix it. This message has my analysis of the situation, but the punchline is very simple: let's replace SharedQueue with a new class as described in issue #1375. Semaphore is just too hard to use correctly.
The problem John's detective work has found is that flushAllSuchThat: is grabbing accessProtect, but other methods such as #next remain free to decrement readSynch even though they don't hold accessProtect. This leads to deadlocks.
That's the short of it. I have written up a longer description of the current implementation's strategy down below. It might make a good addition to the class comment for SharedQueue, if we decide to keep the current implementation around. One thing that gave me pause, by the way, is that the "flush" methods do not do what I think of as "flushing" -- instead, they are "removing" items. "Flush" normally means to push data downstream, not to destroy data completely. It may be better to name these methods with "remove" instead of "flush".
The problem John has found applies more widely than just flushAllSuchThat:--it also impacts nextOrNilSuchThat: and nextOrNil. I'll call these methods "the Bad 3". All of the Bad 3 share the habit of checking readPointer vs. writePointer for available items *before* waiting on readSynch. This is bad because some other process may have done the right thing and waited first on readSynch. Thus, one of the items that the Bad 3 *think* is available based on readPointer vs. writePointer, may have already been reserved by another process by waiting on readSynch.
This observation leads to a simple exhibition of the problem that I have appended below. It uses only next, nextPut:, and nextOrNil. It takes advantage of the fact that nextOrNil does not decrease readSynch at all unless it clears it completely to 0; this can be used to cause readSynch to have more signals on it than there are actual items in the queue.
Here's a solution strategy based on the above analysis. Whenever one of the Bad 3 wants to consider removing an element, it should first try to decrease readSynch. If the attempt succeeds, then it has full rights to the next item in the queue and can procede. If the attempt fails, then there are no more unreserved items in the queue and the method should stop its work. After doing their normal work, the Bad 3 methods should all increase readSynch by the number if items they reserved but did not actually remove, so that other callers may access those items.
Unfortunately, class Semaphore does not sufficiently support "try to wait, but it's okay if there aren't any signals available". It has waitTimeoutSeconds:, which presumably you could use with a timeout of 0, but that is not quite enough because you cannot tell whether it timed out or ate a signal. This functionality is key to the solution strategy I proposed, and so I am giving up on this solution for now.
Given all this analysis, I see three possible ways forward:
1. Make a new SharedQueue based on Nathanael's Monitor implementation.
2. Improve class Semaphore as described above, so that the all-semaphores approach can be made to work.
3. Come up with another solution strategy that can use the current Semaphore class. For example, it may work to have all the reading methods (the Bad 3 and also #next) poll whenever they successfully wait on readSynch but then discover that no items are available.
My inclination is to try #1..... And 30 minutes later, I have done so. Check out Mantis bug #1375. I'll post this message anyway for the record, in case someone wants to perservere with #2 for some reason.
-Lex
CURRENT IMPLEMENTATION STRATEGY
SharedQueue works like this:
contentsArray: holds the actual items. nil entries are unused (and so, nil cannot be sent across a SharedQueue). the buffer is NOT circular.
readPointer: the pointer into contentsArray to start reading the next element. nil's are skipped.
writePointer: the pointer where the next element should be placed.
readSynch: a semaphore holding, roughly, a count of the number of items available to read. To be precise, it holds a number less than or equal to the number of available items; if it holds a lower number, then the delta by which it is lower is the number of calls to #next (and similar methods) that have registered to remove an item but have not yet removed it.
accessProtect: a semaphore used for mutual exclusion. it must be held by code using or modifying contentsArray, readPointer, or writePointer.
==== EXAMPLE CODE ==== "exhibits a problem where nextOrNil does not properly decrease readSynch"
q := SharedQueue new. q nextPut: 1. q nextPut: 2. "readSynch has 2 signals now"
q nextOrNil. "readSynch *still* has 2 signals" [ r1 _ q next ] fork. [ r2 _ q next ] fork. "the second #next will think there is an item available. BOOM." (Delay forSeconds: 1) wait.
"the below code finishes the exercise, but the code has already crashed by this point..." q nextPut: 3. {r1.r2}
squeak-dev@lists.squeakfoundation.org