[cc: vm-dev which I accidentally left out earlier]
Attached my proposed fixes for myList manipulation. The first CS (PrimSuspend-ar) changes primitiveSuspend to enable atomic removal from myList for the non-active process. The second (SuspendFixes-ar) has the main modifications for the in-image part (this may not work for all Squeak versions - I used a Croquet image as the basis, YMMV).
Feedback is welcome, in particular from the usual suspects on vm-dev.
Cheers, - Andreas
Andreas Raab wrote:
I had an eventful (which is euphemistic for @!^# up) morning caused by Process>>terminate. In our last round of delay and semaphore discussions I had noticed that there is a possibility of having a race condition in Process>>terminate but dismissed it as being of an application problem (e.g., if you send #terminate make sure you have only one place where you send it).
This morning proved conclusively that this is a race condition which can affect *every* user of the system. It is caused by Process>>terminate which says:
myList remove: self ifAbsent: .
The reason this is so problematic is that the modification of myList is not atomic and that because of the non-atomic modification there is a possibility of the VM manipulating the very same list concurrently due to an external event (like a network interrupt). When this happens in "just the right way" the effect is that any number of processes at the same priority will "fall off" of the scheduled list. In the image that I was looking at earlier we had the following situation:
- ~40 processes were not running
- The processes had their myList be an empty linked list
- The processes were internally linked (via nextLink)
- The processes were all at the same priority
Given that most of the processes were unrelated other than having the same priority I think the evidence is pretty clear.
The question is now: How can we fix it? My proposal would be to simply change primitiveSuspend such that for a non-active process it will primitively take the process off its suspendingList. This makes suspend a little more general and (by returning the previous suspendingList) it will also guard us against any following cleanup (like the Semaphore situations earlier).
Unfortunately, this *will* require VM changes but I don't think it can be helped at this point since the VM will be manipulating these lists atomically anyway. The good news though is that we can have reasonable fallback code which does just exactly what we do today as a fallback to primitiveSuspend.
Any comments? Alternatives? Suggestions?