Delay and Server reliability

Andreas Raab andreas.raab at gmx.de
Tue Jul 24 08:17:59 UTC 2007


You can apply the fix yourself; it works in all Squeak versions that I'm 
aware of (and if not, you'll find out really quickly ;-) This is just 
the kind of thing for which I wanted to see some sort of "standard 
package" for so that people across various Squeak versions can benefit 
from it.

Cheers,
   - Andreas

Janko Mivšek wrote:
> Hi Andreas,
> 
> That's very important patch and very interesting to me too, because I'm 
> just deciding to put some of my public Aida/Web websites from VW to 
> Squeak and I was afraid of such issues as one you just solved.
> 
> Is there any chance that this patch goes to 3.10?
> 
> Best regards
> Janko
> 
> Andreas Raab wrote:
>> Hi -
>>
>> We recently had some "fun" chasing server lockups (with truly awful 
>> uptimes of about a day or less before things went downhill) and were 
>> finally able to track a huge portion of it down to problems with 
>> Delay. The effect we were seeing on our servers was that the system 
>> would randomly lock up and either go down to 0% CPU or 100% CPU.
>>
>> After poking it with a USR1 signal (which, in our VMs is hooked up 
>> such that it prints all the call stacks in the image; it's a 
>> life-safer if you need to debug these issues) we usually found that 
>> all processes were waiting on Delay's AccessProtect (0%) or 
>> alternatively found that a particular process (the event tickler) 
>> would sit in a tight loop swallowing repeated errors complaining that 
>> "this delay is already scheduled".
>>
>> After hours and hours of testing, debugging, and a little stroke of 
>> luck we finally found out that all of these issues were caused by the 
>> fact that Delay's internal structures are updated by the calling 
>> process (insertion into and removal from SuspendedDelays) which 
>> renders the process susceptible to being terminated in the midst of 
>> updating these structures.
>>
>> If you look at the code, this is obviously an issue because if (for 
>> example) the calling process gets terminated while it's resorting 
>> SuspendedDelays the result is unpredictable. This is in particular an 
>> issue because the calling process is often running at a relatively low 
>> priority so interruption by other, high-priority processes is a common 
>> case. And if any of these higher priority processes kills the one that 
>> just happens to execute SortedCollection>>remove: anything can happen 
>> - from leaving a later delay in front of an earlier one (one of the 
>> cases we had indicated that this was just what had happened) to errors 
>> when doing the next insert/remove ("trying to evaluate a block that is 
>> already evaluated") to many more weirdnesses. Unfortunately, it is 
>> basically impossible to recreate this problem under any kind of 
>> controlled circumstances, mostly because you need a source of events 
>> that is truly independent from your time source.
>>
>> As a consequence of our findings we rewrote Delay to deal with these 
>> issues properly and, having deployed the changes about ten days ago on 
>> our servers, all of these sources of problems simply vanished. We 
>> haven't had a single server problem which we couldn't attribute to our 
>> own stupidity (such as running out of disk space ;-)
>>
>> The changes will in particular be helpful to you if you:
>> * run network servers
>> * fork processes to handle network requests
>> * terminate these processes explicitly (on error conditions for example)
>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>
>> If you have seen random, unexplained lockups of your server (0% CPU 
>> load while being locked up is a dead giveaway[*]) I'd recommend using 
>> the attached changes (which work best on top of a VM with David Lewis' 
>> 64bit fixes applied) and see if that helps. For us, they made the 
>> difference between running the server in Squeak and rewriting it in Java.
>>
>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>
>> [*] The 0% CPU lockups have sometimes been attributed to issues with 
>> Linux wait functions. After having seen the havoc that Delay wrecks on 
>> the system I don't buy these explanations any longer. A much simpler 
>> (and more likely) explanation is that Delay went wild.
>>
>> Cheers,
>>   - Andreas
>>
>>
>> ------------------------------------------------------------------------
>>
>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July 
>> 2007 at 11:53:23 pm'!
>> "Change Set:        SafeDelay
>> Date:            23 July 2007
>> Author:            Andreas Raab
>>
>> This change set fixes a set of severe problems with concurrent use of 
>> Delay. Previously, many of the delay-internal structures were modified 
>> by the calling process which made it susceptible to being terminated 
>> in the middle of manipulating these structures and leave Delay (and 
>> consequently the entire system) in an inconsistent state.
>>
>> This change set fixes this problem by moving *all* manipulation of 
>> Delay's internal structures out of the calling process. As a 
>> side-effect it also removes the requirement of Delays being limited to 
>> SmallInteger range; the new code has no limitation on the duration of 
>> a delay.
>>
>> No tests are provided since outside of true asynchronous environments 
>> (networks) it is basically impossible to recreate the situation 
>> reliably."!
>>
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>> activate
>>     "Private!! Make the receiver the Delay to be awoken when the next 
>> timer interrupt occurs. This method should only be called from a block 
>> protected by the AccessProtect semaphore."
>>     TimerEventLoop ifNotNil:[^nil].
>>     ActiveDelay := self.
>>     ActiveDelayStartTime := Time millisecondClockValue.
>>     ActiveDelayStartTime > resumptionTime ifTrue:[
>>         ActiveDelay signalWaitingProcess.
>>         SuspendedDelays isEmpty ifTrue:[
>>             ActiveDelay := nil.
>>             ActiveDelayStartTime := nil.
>>         ] ifFalse:[SuspendedDelays removeFirst activate].
>>     ] ifFalse:[
>>         TimingSemaphore initSignals.
>>         Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime.
>>     ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> schedule
>>     "Private!! Schedule this Delay, but return immediately rather than 
>> waiting. The receiver's semaphore will be signalled when its delay 
>> duration has elapsed."
>>
>>     beingWaitedOn ifTrue: [self error: 'This Delay has already been 
>> scheduled.'].
>>
>>     TimerEventLoop ifNotNil:[^self scheduleEvent].
>>     AccessProtect critical: [
>>         beingWaitedOn := true.
>>         resumptionTime := Time millisecondClockValue + delayDuration.
>>         ActiveDelay == nil
>>             ifTrue: [self activate]
>>             ifFalse: [
>>                 resumptionTime < ActiveDelay resumptionTime
>>                     ifTrue: [
>>                         SuspendedDelays add: ActiveDelay.
>>                         self activate]
>>                     ifFalse: [SuspendedDelays add: self]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>> scheduleEvent
>>     "Schedule this delay"
>>     resumptionTime := Time millisecondClockValue + delayDuration.
>>     AccessProtect critical:[
>>         ScheduledDelay := self.
>>         TimingSemaphore signal.
>>     ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> unschedule
>>     "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>
>>     | done |
>>     TimerEventLoop ifNotNil:[^self unscheduleEvent].
>>     AccessProtect critical: [
>>         done := false.
>>         [done] whileFalse:
>>             [SuspendedDelays remove: self ifAbsent: [done := true]].
>>         ActiveDelay == self ifTrue: [
>>             SuspendedDelays isEmpty
>>                 ifTrue: [
>>                     ActiveDelay := nil.
>>                     ActiveDelayStartTime := nil]
>>                 ifFalse: [
>>                     SuspendedDelays removeFirst activate]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>> unscheduleEvent
>>     AccessProtect critical:[
>>         FinishedDelay := self.
>>         TimingSemaphore signal.
>>     ].! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn
>>     "Answer whether this delay is currently scheduled, e.g., being 
>> waited on"
>>     ^beingWaitedOn! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn: aBool
>>     "Indicate whether this delay is currently scheduled, e.g., being 
>> waited on"
>>     beingWaitedOn := aBool! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>> delayDuration
>>     ^delayDuration! !
>>
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'!
>> handleTimerEvent
>>     "Handle a timer event; which can be either:
>>         - a schedule request (ScheduledDelay notNil)
>>         - an unschedule request (FinishedDelay notNil)
>>         - a timer signal (not explicitly specified)
>>     We check for timer expiry every time we get a signal."
>>     | nextTick |
>>     "Wait until there is work to do."
>>     TimingSemaphore wait.
>>
>>     "Process any schedule requests"
>>     ScheduledDelay ifNotNil:[
>>         "Schedule the given delay"
>>         self scheduleDelay: ScheduledDelay.
>>         ScheduledDelay := nil.
>>     ].
>>
>>     "Process any unschedule requests"
>>     FinishedDelay ifNotNil:[
>>         self unscheduleDelay: FinishedDelay.
>>         FinishedDelay := nil.
>>     ].
>>
>>     "Check for clock wrap-around."
>>     nextTick := Time millisecondClockValue.
>>     nextTick < ActiveDelayStartTime ifTrue: [
>>         "clock wrapped"
>>         self saveResumptionTimes.
>>         self restoreResumptionTimes.
>>     ].
>>     ActiveDelayStartTime := nextTick.
>>
>>     "Signal any expired delays"
>>     [ActiveDelay notNil and:[
>>         Time millisecondClockValue >= ActiveDelay resumptionTime]] 
>> whileTrue:[
>>             ActiveDelay signalWaitingProcess.
>>             SuspendedDelays isEmpty                 ifTrue: 
>> [ActiveDelay := nil]                 ifFalse:[ActiveDelay := 
>> SuspendedDelays removeFirst].
>>         ].
>>
>>     "And signal when the next request is due. We sleep at most 1sec here
>>     as a soft busy-loop so that we don't accidentally miss signals."
>>     nextTick := Time millisecondClockValue + 1000.
>>     ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay 
>> resumptionTime].
>>     nextTick := nextTick min: SmallInteger maxVal.
>>
>>     "Since we have processed all outstanding requests, reset the 
>> timing semaphore so
>>     that only new work will wake us up again. Do this RIGHT BEFORE 
>> setting the next
>>     wakeup call from the VM because it is only signaled once so we 
>> mustn't miss it."
>>     TimingSemaphore initSignals.
>>     Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'!
>> runTimerEventLoop
>>     "Run the timer event loop."
>>     [
>>         [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>>     ] on: Error do:[:ex|
>>         "Clear out the process so it does't get killed"
>>         TimerEventLoop := nil.
>>         "Launch the old-style interrupt watcher"
>>         self startTimerInterruptWatcher.
>>         "And pass the exception on"
>>         ex pass.
>>     ].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> scheduleDelay: aDelay
>>     "Private. Schedule this Delay."
>>     aDelay beingWaitedOn: true.
>>     ActiveDelay ifNil:[
>>         ActiveDelay := aDelay
>>     ] ifNotNil:[
>>         aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>>             SuspendedDelays add: ActiveDelay.
>>             ActiveDelay := aDelay.
>>         ] ifFalse: [SuspendedDelays add: aDelay].
>>     ].
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'!
>> startTimerEventLoop
>>     "Start the timer event loop"
>>     "Delay startTimerEventLoop"
>>     self stopTimerEventLoop.
>>     self stopTimerInterruptWatcher.
>>     AccessProtect := Semaphore forMutualExclusion.
>>     ActiveDelayStartTime := Time millisecondClockValue.
>>     SuspendedDelays :=         Heap withAll: (SuspendedDelays 
>> ifNil:[#()])
>>             sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 
>> resumptionTime].
>>     TimingSemaphore := Semaphore new.
>>     RunTimerEventLoop := true.
>>     TimerEventLoop := [self runTimerEventLoop] newProcess.
>>     TimerEventLoop priority: Processor timingPriority.
>>     TimerEventLoop resume.
>>     TimingSemaphore signal. "get going"
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> startTimerInterruptWatcher
>>     "Reset the class variables that keep track of active Delays and 
>> re-start the timer interrupt watcher process. Any currently scheduled 
>> delays are forgotten."
>>     "Delay startTimerInterruptWatcher"
>>     | p |
>>     self stopTimerEventLoop.
>>     self stopTimerInterruptWatcher.
>>     TimingSemaphore := Semaphore new.
>>     AccessProtect := Semaphore forMutualExclusion.
>>     SuspendedDelays :=         SortedCollection sortBlock:             
>> [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
>>     ActiveDelay := nil.
>>     p := [self timerInterruptWatcher] newProcess.
>>     p priority: Processor timingPriority.
>>     p resume.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'!
>> stopTimerEventLoop
>>     "Stop the timer event loop"
>>     RunTimerEventLoop := false.
>>     TimingSemaphore signal.
>>     TimerEventLoop := nil.! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'!
>> stopTimerInterruptWatcher
>>     "Reset the class variables that keep track of active Delays and 
>> re-start the timer interrupt watcher process. Any currently scheduled 
>> delays are forgotten."
>>     "Delay startTimerInterruptWatcher"
>>     self primSignal: nil atMilliseconds: 0.
>>     TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'!
>> unscheduleDelay: aDelay
>>     "Private. Unschedule this Delay."
>>     ActiveDelay == aDelay ifTrue: [
>>         SuspendedDelays isEmpty ifTrue:[
>>             ActiveDelay := nil.
>>         ] ifFalse: [
>>             ActiveDelay := SuspendedDelays removeFirst.
>>         ]
>>     ] ifFalse:[
>>         SuspendedDelays remove: aDelay ifAbsent: [].
>>     ].
>>     aDelay beingWaitedOn: false.! !
>>
>> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007 
>> 18:16'!
>> initialize
>>     "Delay initialize"
>>     self startTimerEventLoop.! !
>>
>> Delay initialize!
>>
>>
>> ------------------------------------------------------------------------
>>
>>
> 




More information about the Squeak-dev mailing list