Below a stack trace we've been seeing a couple of times lately in our wxSqueak project (which is mostly 3.8). It seems that SuspendedDelays has a nil element somewhere. However, the code in Delay seems to be safe against that and I can't see any code elsewhere in the image that messes around with delays...
Has anyone ever seen this?
UndefinedObject(Object)>>doesNotUnderstand: #resumptionTime [] in Delay class>>DoIt {[:t2 :t3 | t2 resumptionTime <= t3 resumptionTime]} SortedCollection>>indexForInserting: SortedCollection>>add: [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time millisecondClockValue + dela...]} [] in Semaphore>>critical: {[t2 := t1 value]} BlockContext>>ensure: Semaphore>>critical: Delay>>schedule Delay>>wait TUDispatchingDatagramChannel(TUDatagramChannel)>>retransmissionRun [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[self retransmissionRun]} BlockContext>>repeat [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[[self retransmissionRun] repeat]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
Some more digging revealed a pending VW fix that might be related - it purports to fix a hole in Semaphore>>critical: and I wonder whether it might apply to Squeak as well.
Current Squeak implementation: critical: mutuallyExcludedBlock | blockValue | self wait. [blockValue _ mutuallyExcludedBlock value] ensure: [self signal]. ^blockValue
Current VisualWorks implementation: critical: mutuallyExcludedBlock self waitIfCurtailedSignal. ^mutuallyExcludedBlock ensure: [self signal]
Proposed fix (see http://tinyurl.com/9ed25): critical: mutuallyExcludedBlock | recover | recover := false. ^[ | result | self waitIfCurtailedSignal. "The dynamic translator creates code that can't be interrupted here (I think)." recover := true. result := mutuallyExcludedBlock value. recover := false. "The dynamic translator creates code that can't be interrupted here (I think)." self signal. result ] ifCurtailed: [ recover ifTrue: [ recover := false. self signal]].
The fix has some test code in its comment (again, see the tinyurl) that I'm going to run now, but I was wondering whether people could comment on the applicability to Squeak of this fix. In VW, it seems to make a difference...
On 1/5/06, Cees De Groot cdegroot@gmail.com wrote:
Below a stack trace we've been seeing a couple of times lately in our wxSqueak project (which is mostly 3.8). It seems that SuspendedDelays has a nil element somewhere. However, the code in Delay seems to be safe against that and I can't see any code elsewhere in the image that messes around with delays...
Has anyone ever seen this?
UndefinedObject(Object)>>doesNotUnderstand: #resumptionTime [] in Delay class>>DoIt {[:t2 :t3 | t2 resumptionTime <= t3 resumptionTime]} SortedCollection>>indexForInserting: SortedCollection>>add: [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time millisecondClockValue + dela...]} [] in Semaphore>>critical: {[t2 := t1 value]} BlockContext>>ensure: Semaphore>>critical: Delay>>schedule Delay>>wait TUDispatchingDatagramChannel(TUDatagramChannel)>>retransmissionRun [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[self retransmissionRun]} BlockContext>>repeat [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[[self retransmissionRun] repeat]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
Well, the test code which I reproduce below throws an error very quickly on Squeak. When I patch #critical:, the test runs out just fine.
Should this be adopted?
100 timesRepeat: [ | s procs | s := Semaphore forMutualExclusion. procs := (1 to: 50) collect: [:i | [Processor yield. [s critical: []] repeat] forkAt: Processor activeProcess priority - 1]. procs do: [:p | (Delay forMilliseconds: 2) wait. p terminate]. (Delay forMilliseconds: 100) wait. s isSignaled ifFalse: [self error: 'Failure to release critical semaphore']].
On 1/23/06, Cees De Groot cdegroot@gmail.com wrote:
Some more digging revealed a pending VW fix that might be related - it purports to fix a hole in Semaphore>>critical: and I wonder whether it might apply to Squeak as well.
And here is another funny one. Two colleagues of mine where looking whether our trouble was related to SortedCollection in combination with Semaphore (these are the primary objects involved with the original issue, and we have another couple of bug reports which shows sorted collections exhibiting funny behavior even though the are protected by Semaphores).
They whipped this up:
10 timesRepeat: [ | collection threadCount semaphore safeguard | collection := SortedCollection new.
safeguard := Semaphore forMutualExclusion.
threadCount := 50. semaphore := Semaphore new. semaphore signal.
(1 to: threadCount) do: [:each | Transcript show: each asString;cr. semaphore signal. [1 to: 250000 do: [:x | safeguard critical: [collection add: x; remove: x]]] fork. semaphore wait].
semaphore wait.
self assert: collection isEmpty]
This code reliably crashes in the "remove: x" mentioning that the object is not present in the collection. Which, looking at the code, is impossible.
Is Semaphore broken? Or are we overlooking something in our test?
By the way - you may need to run this one more than once. It doesn't happen on every run.
On 1/23/06, Cees De Groot cdegroot@gmail.com wrote:
And here is another funny one.
Discussing this with Radoslav Hodničák on the Parcplace.net #smalltalk channel turned up that this could be due to blocks not being real closures in Squeak. Sending #fixTemps to the block that has [:x |...] seems to prevent the crash.
"<rh> process 1 is invoked with x=1, runs into critical, adds 1 to collection, get preempted. process 2 is invokes with x=2 (x is now 2 for all processes), runs into critical block, stops, processes 1 now tries to remove 2 and fails"
which makes sense.
Urk! I thought Smalltalk had evolved to the point where blocks were proper closures... this will definitely bite me as I have many years of functional programming language intuition, and the subtle distinction here will certainly lead me astray. #fixTemps doesn't do what I expect either.
Is this just Squeak? Do other Smalltalks get it right? Are there plans to fix this in Squeak?
../Dave
On 1/23/06, Dave Mason dmason@sarg.ryerson.ca wrote:
Is this just Squeak?
There are Smalltalks that have real closures. But I can't give an exhaustive list so I cannot answer this in the positive.
to fix this in Squeak?
Yup, in 3.9a it already has been fixed IIRC.
On 23.01.2006, at 16:06, Cees De Groot wrote:
On 1/23/06, Dave Mason dmason@sarg.ryerson.ca wrote:
Is this just Squeak?
There are Smalltalks that have real closures. But I can't give an exhaustive list so I cannot answer this in the positive.
to fix this in Squeak?
Yup, in 3.9a it already has been fixed IIRC.
More like one step towards it.... you can install an optional compiler from SqueakMap to have full closures... so there will be runtime support in 3.9, but you need to install the additional compiler package to make use of them.
Marcus
On 23 janv. 06, at 16:06, Cees De Groot wrote:
On 1/23/06, Dave Mason dmason@sarg.ryerson.ca wrote:
Is this just Squeak?
There are Smalltalks that have real closures. But I can't give an exhaustive list so I cannot answer this in the positive.
VisualWorks VisualAge Dolphin Smalltalk/X GnuSmalltalk should have full blockclosures
to fix this in Squeak?
Yup, in 3.9a it already has been fixed IIRC.
Is this just Squeak? Do other Smalltalks get it right?
VW of course gets it right. I don't know about others.
Are there plans to fix this in Squeak?
#compileUseNewCompiler #compileBlocksAsColsures The problem is that this makes all blocks (which are not inlined like #ifTrue:) much slower.
This is a real issue with Seaside as well. In general as soon as you loop and use and html render block, you are in need of #fixTemps.
Cees De Groot wrote:
Should this be adopted?
The VW fix does seem overly complex for Squeak - the main issue is that in Semaphore>>critical: we can be interrupted between the following two lines:
self wait. [blockValue := mutuallyExcludedBlock value] ensure:[self signal].
Simply moving #wait inside the ensure'd block will cure the problem (for very, very specific and nitpicky reasons that I'm not going to explain in detail unless someone actually wants to know ;-)
Cheers, - Andreas
Er, if we're messing with this, and with rescheduling processes when you change priorities, do we want to consider how Processor>>yield behaves? Should it allow a lower priority process to run if there are no processes runable at the same priority?
On 23-Jan-06, at 7:54 PM, Andreas Raab wrote:
Cees De Groot wrote:
Should this be adopted?
The VW fix does seem overly complex for Squeak - the main issue is that in Semaphore>>critical: we can be interrupted between the following two lines:
self wait. [blockValue := mutuallyExcludedBlock value] ensure:[self signal].
Simply moving #wait inside the ensure'd block will cure the problem (for very, very specific and nitpicky reasons that I'm not going to explain in detail unless someone actually wants to know ;-)
Cheers,
- Andreas
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On 23-Jan-06, at 8:19 PM, John M McIntosh wrote:
Er, if we're messing with this, and with rescheduling processes when you change priorities, do we want to consider how Processor>>yield behaves? Should it allow a lower priority process to run if there are no processes runable at the same priority?
If you mean 'no >other< processes' then it would be almost completely pointless since at the very next opportunity the higher priority process would preempt the low priority one. In the case of 'no processes' then obviously a lower priority process would get a turn anyway.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Cackles a lot, but I ain't seen no eggs yet.
Well right now it only checks for processes in the same priority group, doesn't consider ones higher, mind they should run at some point. Also doesn't consider any lower, so they'll never run, which is different from will run *a bit*. Howwcwe I will accept that run *a bit* might be a worthless exercise? Still if the intent is to yield it seems that is a poor choice of words if it never yields to lower priority processes.
On 23-Jan-06, at 9:36 PM, tim Rowledge wrote:
On 23-Jan-06, at 8:19 PM, John M McIntosh wrote:
Er, if we're messing with this, and with rescheduling processes when you change priorities, do we want to consider how Processor>>yield behaves? Should it allow a lower priority process to run if there are no processes runable at the same priority?
If you mean 'no >other< processes' then it would be almost completely pointless since at the very next opportunity the higher priority process would preempt the low priority one. In the case of 'no processes' then obviously a lower priority process would get a turn anyway.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Cackles a lot, but I ain't seen no eggs yet.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On 23-Jan-06, at 10:59 PM, John M McIntosh wrote:
Well right now it only checks for processes in the same priority group, doesn't consider ones higher, mind they should run at some point.
Exactly - you don't need to worry about the higher priority processes since they get a chance to take over at every checkInterrupts.
Fiddling with yield won't achieve anything interesting. If a cpu sharing scheduler is wanted it would take a quite different chunk of code.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: CM: Circulate Memory
On 1/24/06, Andreas Raab andreas.raab@gmx.de wrote:
Simply moving #wait inside the ensure'd block will cure the problem (for very, very specific and nitpicky reasons that I'm not going to explain in detail unless someone actually wants to know ;-)
I do. I have a customer who is seeing error reports because of funny values turning up in collections that are supposed to be protected by critical sections. And the funny values turning up are very much the sort of stuff you get when that protection fails, so, yes, I desperately want to know about every possibility, no matter how remote
(apart from that, I'm a curious guy ;-))
Ah, well, all right. The trick isn't in fixing the old problem - the trick is in avoiding the new problem. Moving the call to #wait inside the ensured block certainly will fix the old problem of the semaphore not being signaled when the process gets terminated at the wrong point but how do you avoid the semaphore being signaled erranously because the process gets aborted after entering the ensured block and *before* we end up in #wait? (the same problem in reverse and just as bad)
The main trick here is that we really check for interrupts *after* executing primitives (and not before or during) which means that because (and only because!) wait is a primitive its call will not be interrupted (this btw, is one of the reasons why message tally sometimes associates a lot of time with a "wrong" primitive call - it cannot interrupt it before the primitive is executed and therefore sees the "wrong" method being run). Which means that if you were to change #wait to call a method #primitiveWait it would no longer be safe (possible interruption after the primitive check in #wait). It might be fun to try this out!
Hope this gets you the gist of it (it's really late here).
Cheers, - Andreas
Cees De Groot wrote:
On 1/24/06, Andreas Raab andreas.raab@gmx.de wrote:
Simply moving #wait inside the ensure'd block will cure the problem (for very, very specific and nitpicky reasons that I'm not going to explain in detail unless someone actually wants to know ;-)
I do. I have a customer who is seeing error reports because of funny values turning up in collections that are supposed to be protected by critical sections. And the funny values turning up are very much the sort of stuff you get when that protection fails, so, yes, I desperately want to know about every possibility, no matter how remote
(apart from that, I'm a curious guy ;-))
On 1/24/06, Andreas Raab andreas.raab@gmx.de wrote:
Hope this gets you the gist of it (it's really late here).
Yeah, I think I do. Thanks for the explanation.
Well, I have "proof" now: http://www.cdegroot.com/blog/wp-content/uploads/screenshot.png shows a screenshot with two debuggers that seem to be inside the same critical section...
The VM's dump stack results in:
SortedCollection>>add: [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time millisecondClockValue + dela...]} [] in Semaphore>>critical: {[[self wait] ifCurtailed: [self signal]. recover := true. result := mutu...]} BlockContext>>ifCurtailed: Semaphore>>critical: Delay>>schedule Delay>>wait WorldState>>interCyclePause: WorldState>>doOneCycleFor: PasteUpMorph>>doOneCycle [] in Project class>>spawnNewProcess {[[World doOneCycle. Processor yield. false] whileFalse. nil]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
It *seems* that this only happens in wxSqueak. I have SysInternals' process explorer running, and the Squeak VM is running a bunch of threads but I'm not sure what these threads do. But apart from the guess that something like this could happen when multiple native threads start executing Squeak code, I'm at a loss what other cause could be behind this...
HEEELP!!! ;-)
I'll just keep spitting out info until someone gets a brilliant idea ;-)
http://elec.tric.nl/~cg/delay2.dump.txt (Unix LF terminated format)
I created a little bit of code that dumps all processes when something goes wrong, and this dump has two processes inside the critical section of Delay>>schedule.
One starts on line 453 ('a Process in Process>>suspend'). One starts on line 90 ('a Process in nil').
If you look at it, there's one funny thing. I added an #on:do: handler around the actual addition to the "SuspendedDelays add:..." bit in Delay>>schedule:
.... [SuspendedDelays add: ActiveDelay] on: Error do: [:ex | Processor dumpOnFile: 'delay2.dump'. self halt]
If you follow the stack frames from line 90, you'll encounter this #on:do:. However, if you do the same with the other process, there's no stack frame for the exception handler. the Delay>>schedule frame directly calls the SortedCollection>>add: frame.
I checked, double checked, and checked again - this happens every time I get this bug in my image (available on request ;-)), it happens currently quite soon after I fire up the wx-based program, and always one of the processes - which can be different processes - misses this stack frame.
Could it be that I have somehow two copies of Delay in my image? Which share the ScheduledDelays collection but not the Semaphore?
This is sooo crazy, it must have an easy solution :-)
On 2/1/06, Cees De Groot cdegroot@gmail.com wrote:
If you look at it, there's one funny thing. I added an #on:do: handler around the actual addition to the "SuspendedDelays add:..." bit in Delay>>schedule:
Not funny at all. Just my stupidity - there are two calls to #add: in that method, I protected only one.
On IRC, Jecel Assumpção remarks that the Semaphore in both cases is 'in rest' - excessSignals is 1, where 0 would be expected when the critical section is active. And -1 when two critical sections are active :-)
On 2/1/06, Cees De Groot cdegroot@gmail.com wrote:
On IRC, Jecel Assumpção remarks that the Semaphore in both cases is 'in rest' - excessSignals is 1, where 0 would be expected when the critical section is active. And -1 when two critical sections are active :-)
More debugging, it turns out that AccessProtect's excessSignals indeed can become > 1 in the application. Whether that is due to the many times that we terminate processes inside delays or not, I don't know. Fact is that in my current image, the application starts up, does a lot of stuff exercising Delay, and then hits this condition. It must be some sort of race condition.
On IRC, Craig suggested I move over to the VM level for debugging. However, I'm not sure how to proceed - I could presumably trap excessSignals becoming '2' on all semaphores, but only some of them are used as Mutex so that breakpoint would be called spuriously (unless someone knows how to formulate a bp that only hits on semaphores setup as mutex...). And then, the culprit could very well be the first time the semaphore is signalled, but I can only set a breakpoint on the second time. Manually checking all the times the BP is hit is undoable, hundreds of Delays are scheduled before the bug hits.
Any suggestions on how we could proceed?
Thanks,
Cees
Try making an Array to put them (the sems you know are going to be used as mutex) in so GC keeps everything up to date and 'register' this array. Easiest way to do that would be to stick it in the specialOops array somewhere, there must be an empty slot. Hell, nobody uses 40/41 so far as I know, abuse them for a bit. Once you have this anchor you can follow the oops in the vm easily.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: RDL: Rotate Disk Left
Well if you recall a couple of weeks back I had a changeset that turns on printing of the stack in message send, so you can see the stack on each message send by process.
Mind you'll need to build a VM for this, plus write some code to read the log file and determine how the semaphore signalling is done.
Or of course change the VM to note the excessive semaphore signalling and print the stack when it happens, and as Tim points out use the Special objects array to track the semaphore of interest.
On 1-Feb-06, at 3:42 PM, Cees De Groot wrote:
On 2/1/06, Cees De Groot cdegroot@gmail.com wrote:
On IRC, Jecel Assumpção remarks that the Semaphore in both cases is 'in rest' - excessSignals is 1, where 0 would be expected when the critical section is active. And -1 when two critical sections are active :-)
Any suggestions on how we could proceed?
Thanks,
Cees
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Thanks John & Tim, for these suggestions. I'll continue debugging tonight (have to work in VisualAge today, work hard, earn money) and post as I make progress (or not).
Regards,
Cees
On 2/2/06, John M McIntosh johnmci@smalltalkconsulting.com wrote:
Well if you recall a couple of weeks back I had a changeset that turns on printing of the stack in message send, so you can see the stack on each message send by process.
Mind you'll need to build a VM for this, plus write some code to read the log file and determine how the semaphore signalling is done.
Or of course change the VM to note the excessive semaphore signalling and print the stack when it happens, and as Tim points out use the Special objects array to track the semaphore of interest.
On 1-Feb-06, at 3:42 PM, Cees De Groot wrote:
On 2/1/06, Cees De Groot cdegroot@gmail.com wrote:
On IRC, Jecel Assumpção remarks that the Semaphore in both cases is 'in rest' - excessSignals is 1, where 0 would be expected when the critical section is active. And -1 when two critical sections are active :-)
Any suggestions on how we could proceed?
Thanks,
Cees
--
=== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
BTW, you wouldn't kill a process *while* it is trying to execute SortedCollection>>add: would you? Because this might get you a nil in SuspendedDelays faster than you could say "oops". And I agree with your analysis that something must be signaling the mutex twice but I will point out that one of the processes was the wx event loop which may very well be invoked via unlikely places (if I remember the discussion with Rob then he had serious problems getting wx to work asynchronously).
Cheers, - Andreas
Cees De Groot wrote:
Thanks John & Tim, for these suggestions. I'll continue debugging tonight (have to work in VisualAge today, work hard, earn money) and post as I make progress (or not).
Regards,
Cees
On 2/2/06, John M McIntosh johnmci@smalltalkconsulting.com wrote:
Well if you recall a couple of weeks back I had a changeset that turns on printing of the stack in message send, so you can see the stack on each message send by process.
Mind you'll need to build a VM for this, plus write some code to read the log file and determine how the semaphore signalling is done.
Or of course change the VM to note the excessive semaphore signalling and print the stack when it happens, and as Tim points out use the Special objects array to track the semaphore of interest.
On 1-Feb-06, at 3:42 PM, Cees De Groot wrote:
On 2/1/06, Cees De Groot cdegroot@gmail.com wrote:
On IRC, Jecel Assumpção remarks that the Semaphore in both cases is 'in rest' - excessSignals is 1, where 0 would be expected when the critical section is active. And -1 when two critical sections are active :-)
Any suggestions on how we could proceed?
Thanks,
Cees
--
=== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On 2/2/06, Andreas Raab andreas.raab@gmx.de wrote:
BTW, you wouldn't kill a process *while* it is trying to execute SortedCollection>>add: would you?
I wouldn't know, of course, but yes, it could happen. Thanks for pointing it out.
We do lots of network searches, which are sent into the network, and then a process sleeps for a while and shuts down the search. If the search gives results, the process is terminated. Conceivably, this could happen while the process is being scheduled. Unlikely, but possible.
Shouldn't this whole Delay>>schedule thing maybe be done in an un-preemteable block? Or is this something we would need to solve on our application level?
And I agree with your analysis that something must be signaling the mutex twice but I will point out that one of the processes was the wx event loop which may very well be invoked via unlikely places (if I remember the discussion with Rob then he had serious problems getting wx to work asynchronously).
Yup. I'm still not 100% convinced that we can't have a situation like two native threads going into the VM at the same time, but that is probably mostly ignorance.
squeak-dev@lists.squeakfoundation.org