Delay and Server reliability

stephane ducasse stephane.ducasse at free.fr
Tue Jul 24 16:32:36 UTC 2007


Yes.
I imagine the pain you got to chase it....
This kind of bug huge pain.

Stef

On 24 juil. 07, at 10:17, Andreas Raab wrote:

> You can apply the fix yourself; it works in all Squeak versions  
> that I'm aware of (and if not, you'll find out really quickly ;-)  
> This is just the kind of thing for which I wanted to see some sort  
> of "standard package" for so that people across various Squeak  
> versions can benefit from it.
>
> Cheers,
>   - Andreas
>
> Janko Mivšek wrote:
>> Hi Andreas,
>> That's very important patch and very interesting to me too,  
>> because I'm just deciding to put some of my public Aida/Web  
>> websites from VW to Squeak and I was afraid of such issues as one  
>> you just solved.
>> Is there any chance that this patch goes to 3.10?
>> Best regards
>> Janko
>> Andreas Raab wrote:
>>> Hi -
>>>
>>> We recently had some "fun" chasing server lockups (with truly  
>>> awful uptimes of about a day or less before things went downhill)  
>>> and were finally able to track a huge portion of it down to  
>>> problems with Delay. The effect we were seeing on our servers was  
>>> that the system would randomly lock up and either go down to 0%  
>>> CPU or 100% CPU.
>>>
>>> After poking it with a USR1 signal (which, in our VMs is hooked  
>>> up such that it prints all the call stacks in the image; it's a  
>>> life-safer if you need to debug these issues) we usually found  
>>> that all processes were waiting on Delay's AccessProtect (0%) or  
>>> alternatively found that a particular process (the event tickler)  
>>> would sit in a tight loop swallowing repeated errors complaining  
>>> that "this delay is already scheduled".
>>>
>>> After hours and hours of testing, debugging, and a little stroke  
>>> of luck we finally found out that all of these issues were caused  
>>> by the fact that Delay's internal structures are updated by the  
>>> calling process (insertion into and removal from SuspendedDelays)  
>>> which renders the process susceptible to being terminated in the  
>>> midst of updating these structures.
>>>
>>> If you look at the code, this is obviously an issue because if  
>>> (for example) the calling process gets terminated while it's  
>>> resorting SuspendedDelays the result is unpredictable. This is in  
>>> particular an issue because the calling process is often running  
>>> at a relatively low priority so interruption by other, high- 
>>> priority processes is a common case. And if any of these higher  
>>> priority processes kills the one that just happens to execute  
>>> SortedCollection>>remove: anything can happen - from leaving a  
>>> later delay in front of an earlier one (one of the cases we had  
>>> indicated that this was just what had happened) to errors when  
>>> doing the next insert/remove ("trying to evaluate a block that is  
>>> already evaluated") to many more weirdnesses. Unfortunately, it  
>>> is basically impossible to recreate this problem under any kind  
>>> of controlled circumstances, mostly because you need a source of  
>>> events that is truly independent from your time source.
>>>
>>> As a consequence of our findings we rewrote Delay to deal with  
>>> these issues properly and, having deployed the changes about ten  
>>> days ago on our servers, all of these sources of problems simply  
>>> vanished. We haven't had a single server problem which we  
>>> couldn't attribute to our own stupidity (such as running out of  
>>> disk space ;-)
>>>
>>> The changes will in particular be helpful to you if you:
>>> * run network servers
>>> * fork processes to handle network requests
>>> * terminate these processes explicitly (on error conditions for  
>>> example)
>>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>>
>>> If you have seen random, unexplained lockups of your server (0%  
>>> CPU load while being locked up is a dead giveaway[*]) I'd  
>>> recommend using the attached changes (which work best on top of a  
>>> VM with David Lewis' 64bit fixes applied) and see if that helps.  
>>> For us, they made the difference between running the server in  
>>> Squeak and rewriting it in Java.
>>>
>>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>>
>>> [*] The 0% CPU lockups have sometimes been attributed to issues  
>>> with Linux wait functions. After having seen the havoc that Delay  
>>> wrecks on the system I don't buy these explanations any longer. A  
>>> much simpler (and more likely) explanation is that Delay went wild.
>>>
>>> Cheers,
>>>   - Andreas
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>>
>>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23  
>>> July 2007 at 11:53:23 pm'!
>>> "Change Set:        SafeDelay
>>> Date:            23 July 2007
>>> Author:            Andreas Raab
>>>
>>> This change set fixes a set of severe problems with concurrent  
>>> use of Delay. Previously, many of the delay-internal structures  
>>> were modified by the calling process which made it susceptible to  
>>> being terminated in the middle of manipulating these structures  
>>> and leave Delay (and consequently the entire system) in an  
>>> inconsistent state.
>>>
>>> This change set fixes this problem by moving *all* manipulation  
>>> of Delay's internal structures out of the calling process. As a  
>>> side-effect it also removes the requirement of Delays being  
>>> limited to SmallInteger range; the new code has no limitation on  
>>> the duration of a delay.
>>>
>>> No tests are provided since outside of true asynchronous  
>>> environments (networks) it is basically impossible to recreate  
>>> the situation reliably."!
>>>
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>>> activate
>>>     "Private!! Make the receiver the Delay to be awoken when the  
>>> next timer interrupt occurs. This method should only be called  
>>> from a block protected by the AccessProtect semaphore."
>>>     TimerEventLoop ifNotNil:[^nil].
>>>     ActiveDelay := self.
>>>     ActiveDelayStartTime := Time millisecondClockValue.
>>>     ActiveDelayStartTime > resumptionTime ifTrue:[
>>>         ActiveDelay signalWaitingProcess.
>>>         SuspendedDelays isEmpty ifTrue:[
>>>             ActiveDelay := nil.
>>>             ActiveDelayStartTime := nil.
>>>         ] ifFalse:[SuspendedDelays removeFirst activate].
>>>     ] ifFalse:[
>>>         TimingSemaphore initSignals.
>>>         Delay primSignal: TimingSemaphore atMilliseconds:  
>>> resumptionTime.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> schedule
>>>     "Private!! Schedule this Delay, but return immediately rather  
>>> than waiting. The receiver's semaphore will be signalled when its  
>>> delay duration has elapsed."
>>>
>>>     beingWaitedOn ifTrue: [self error: 'This Delay has already  
>>> been scheduled.'].
>>>
>>>     TimerEventLoop ifNotNil:[^self scheduleEvent].
>>>     AccessProtect critical: [
>>>         beingWaitedOn := true.
>>>         resumptionTime := Time millisecondClockValue +  
>>> delayDuration.
>>>         ActiveDelay == nil
>>>             ifTrue: [self activate]
>>>             ifFalse: [
>>>                 resumptionTime < ActiveDelay resumptionTime
>>>                     ifTrue: [
>>>                         SuspendedDelays add: ActiveDelay.
>>>                         self activate]
>>>                     ifFalse: [SuspendedDelays add: self]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>>> scheduleEvent
>>>     "Schedule this delay"
>>>     resumptionTime := Time millisecondClockValue + delayDuration.
>>>     AccessProtect critical:[
>>>         ScheduledDelay := self.
>>>         TimingSemaphore signal.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> unschedule
>>>     "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>>
>>>     | done |
>>>     TimerEventLoop ifNotNil:[^self unscheduleEvent].
>>>     AccessProtect critical: [
>>>         done := false.
>>>         [done] whileFalse:
>>>             [SuspendedDelays remove: self ifAbsent: [done := true]].
>>>         ActiveDelay == self ifTrue: [
>>>             SuspendedDelays isEmpty
>>>                 ifTrue: [
>>>                     ActiveDelay := nil.
>>>                     ActiveDelayStartTime := nil]
>>>                 ifFalse: [
>>>                     SuspendedDelays removeFirst activate]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>>> unscheduleEvent
>>>     AccessProtect critical:[
>>>         FinishedDelay := self.
>>>         TimingSemaphore signal.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn
>>>     "Answer whether this delay is currently scheduled, e.g.,  
>>> being waited on"
>>>     ^beingWaitedOn! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn: aBool
>>>     "Indicate whether this delay is currently scheduled, e.g.,  
>>> being waited on"
>>>     beingWaitedOn := aBool! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>>> delayDuration
>>>     ^delayDuration! !
>>>
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 10:35'!
>>> handleTimerEvent
>>>     "Handle a timer event; which can be either:
>>>         - a schedule request (ScheduledDelay notNil)
>>>         - an unschedule request (FinishedDelay notNil)
>>>         - a timer signal (not explicitly specified)
>>>     We check for timer expiry every time we get a signal."
>>>     | nextTick |
>>>     "Wait until there is work to do."
>>>     TimingSemaphore wait.
>>>
>>>     "Process any schedule requests"
>>>     ScheduledDelay ifNotNil:[
>>>         "Schedule the given delay"
>>>         self scheduleDelay: ScheduledDelay.
>>>         ScheduledDelay := nil.
>>>     ].
>>>
>>>     "Process any unschedule requests"
>>>     FinishedDelay ifNotNil:[
>>>         self unscheduleDelay: FinishedDelay.
>>>         FinishedDelay := nil.
>>>     ].
>>>
>>>     "Check for clock wrap-around."
>>>     nextTick := Time millisecondClockValue.
>>>     nextTick < ActiveDelayStartTime ifTrue: [
>>>         "clock wrapped"
>>>         self saveResumptionTimes.
>>>         self restoreResumptionTimes.
>>>     ].
>>>     ActiveDelayStartTime := nextTick.
>>>
>>>     "Signal any expired delays"
>>>     [ActiveDelay notNil and:[
>>>         Time millisecondClockValue >= ActiveDelay  
>>> resumptionTime]] whileTrue:[
>>>             ActiveDelay signalWaitingProcess.
>>>             SuspendedDelays isEmpty                 ifTrue:  
>>> [ActiveDelay := nil]                 ifFalse:[ActiveDelay :=  
>>> SuspendedDelays removeFirst].
>>>         ].
>>>
>>>     "And signal when the next request is due. We sleep at most  
>>> 1sec here
>>>     as a soft busy-loop so that we don't accidentally miss signals."
>>>     nextTick := Time millisecondClockValue + 1000.
>>>     ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay  
>>> resumptionTime].
>>>     nextTick := nextTick min: SmallInteger maxVal.
>>>
>>>     "Since we have processed all outstanding requests, reset the  
>>> timing semaphore so
>>>     that only new work will wake us up again. Do this RIGHT  
>>> BEFORE setting the next
>>>     wakeup call from the VM because it is only signaled once so  
>>> we mustn't miss it."
>>>     TimingSemaphore initSignals.
>>>     Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 09:04'!
>>> runTimerEventLoop
>>>     "Run the timer event loop."
>>>     [
>>>         [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>>>     ] on: Error do:[:ex|
>>>         "Clear out the process so it does't get killed"
>>>         TimerEventLoop := nil.
>>>         "Launch the old-style interrupt watcher"
>>>         self startTimerInterruptWatcher.
>>>         "And pass the exception on"
>>>         ex pass.
>>>     ].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:32'!
>>> scheduleDelay: aDelay
>>>     "Private. Schedule this Delay."
>>>     aDelay beingWaitedOn: true.
>>>     ActiveDelay ifNil:[
>>>         ActiveDelay := aDelay
>>>     ] ifNotNil:[
>>>         aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>>>             SuspendedDelays add: ActiveDelay.
>>>             ActiveDelay := aDelay.
>>>         ] ifFalse: [SuspendedDelays add: aDelay].
>>>     ].
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 10:18'!
>>> startTimerEventLoop
>>>     "Start the timer event loop"
>>>     "Delay startTimerEventLoop"
>>>     self stopTimerEventLoop.
>>>     self stopTimerInterruptWatcher.
>>>     AccessProtect := Semaphore forMutualExclusion.
>>>     ActiveDelayStartTime := Time millisecondClockValue.
>>>     SuspendedDelays :=         Heap withAll: (SuspendedDelays  
>>> ifNil:[#()])
>>>             sortBlock: [:d1 :d2 | d1 resumptionTime <= d2  
>>> resumptionTime].
>>>     TimingSemaphore := Semaphore new.
>>>     RunTimerEventLoop := true.
>>>     TimerEventLoop := [self runTimerEventLoop] newProcess.
>>>     TimerEventLoop priority: Processor timingPriority.
>>>     TimerEventLoop resume.
>>>     TimingSemaphore signal. "get going"
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:32'!
>>> startTimerInterruptWatcher
>>>     "Reset the class variables that keep track of active Delays  
>>> and re-start the timer interrupt watcher process. Any currently  
>>> scheduled delays are forgotten."
>>>     "Delay startTimerInterruptWatcher"
>>>     | p |
>>>     self stopTimerEventLoop.
>>>     self stopTimerInterruptWatcher.
>>>     TimingSemaphore := Semaphore new.
>>>     AccessProtect := Semaphore forMutualExclusion.
>>>     SuspendedDelays :=         SortedCollection  
>>> sortBlock:             [:d1 :d2 | d1 resumptionTime <= d2  
>>> resumptionTime].
>>>     ActiveDelay := nil.
>>>     p := [self timerInterruptWatcher] newProcess.
>>>     p priority: Processor timingPriority.
>>>     p resume.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 21:26'!
>>> stopTimerEventLoop
>>>     "Stop the timer event loop"
>>>     RunTimerEventLoop := false.
>>>     TimingSemaphore signal.
>>>     TimerEventLoop := nil.! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 21:32'!
>>> stopTimerInterruptWatcher
>>>     "Reset the class variables that keep track of active Delays  
>>> and re-start the timer interrupt watcher process. Any currently  
>>> scheduled delays are forgotten."
>>>     "Delay startTimerInterruptWatcher"
>>>     self primSignal: nil atMilliseconds: 0.
>>>     TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:33'!
>>> unscheduleDelay: aDelay
>>>     "Private. Unschedule this Delay."
>>>     ActiveDelay == aDelay ifTrue: [
>>>         SuspendedDelays isEmpty ifTrue:[
>>>             ActiveDelay := nil.
>>>         ] ifFalse: [
>>>             ActiveDelay := SuspendedDelays removeFirst.
>>>         ]
>>>     ] ifFalse:[
>>>         SuspendedDelays remove: aDelay ifAbsent: [].
>>>     ].
>>>     aDelay beingWaitedOn: false.! !
>>>
>>> !Delay class methodsFor: 'class initialization' stamp: 'ar  
>>> 7/11/2007 18:16'!
>>> initialize
>>>     "Delay initialize"
>>>     self startTimerEventLoop.! !
>>>
>>> Delay initialize!
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>>
>>>
>
>
>




More information about the Squeak-dev mailing list