Delay and Server reliability
Andreas Raab
andreas.raab at gmx.de
Tue Jul 24 08:17:59 UTC 2007
You can apply the fix yourself; it works in all Squeak versions that I'm
aware of (and if not, you'll find out really quickly ;-) This is just
the kind of thing for which I wanted to see some sort of "standard
package" for so that people across various Squeak versions can benefit
from it.
Cheers,
- Andreas
Janko Mivšek wrote:
> Hi Andreas,
>
> That's very important patch and very interesting to me too, because I'm
> just deciding to put some of my public Aida/Web websites from VW to
> Squeak and I was afraid of such issues as one you just solved.
>
> Is there any chance that this patch goes to 3.10?
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi -
>>
>> We recently had some "fun" chasing server lockups (with truly awful
>> uptimes of about a day or less before things went downhill) and were
>> finally able to track a huge portion of it down to problems with
>> Delay. The effect we were seeing on our servers was that the system
>> would randomly lock up and either go down to 0% CPU or 100% CPU.
>>
>> After poking it with a USR1 signal (which, in our VMs is hooked up
>> such that it prints all the call stacks in the image; it's a
>> life-safer if you need to debug these issues) we usually found that
>> all processes were waiting on Delay's AccessProtect (0%) or
>> alternatively found that a particular process (the event tickler)
>> would sit in a tight loop swallowing repeated errors complaining that
>> "this delay is already scheduled".
>>
>> After hours and hours of testing, debugging, and a little stroke of
>> luck we finally found out that all of these issues were caused by the
>> fact that Delay's internal structures are updated by the calling
>> process (insertion into and removal from SuspendedDelays) which
>> renders the process susceptible to being terminated in the midst of
>> updating these structures.
>>
>> If you look at the code, this is obviously an issue because if (for
>> example) the calling process gets terminated while it's resorting
>> SuspendedDelays the result is unpredictable. This is in particular an
>> issue because the calling process is often running at a relatively low
>> priority so interruption by other, high-priority processes is a common
>> case. And if any of these higher priority processes kills the one that
>> just happens to execute SortedCollection>>remove: anything can happen
>> - from leaving a later delay in front of an earlier one (one of the
>> cases we had indicated that this was just what had happened) to errors
>> when doing the next insert/remove ("trying to evaluate a block that is
>> already evaluated") to many more weirdnesses. Unfortunately, it is
>> basically impossible to recreate this problem under any kind of
>> controlled circumstances, mostly because you need a source of events
>> that is truly independent from your time source.
>>
>> As a consequence of our findings we rewrote Delay to deal with these
>> issues properly and, having deployed the changes about ten days ago on
>> our servers, all of these sources of problems simply vanished. We
>> haven't had a single server problem which we couldn't attribute to our
>> own stupidity (such as running out of disk space ;-)
>>
>> The changes will in particular be helpful to you if you:
>> * run network servers
>> * fork processes to handle network requests
>> * terminate these processes explicitly (on error conditions for example)
>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>
>> If you have seen random, unexplained lockups of your server (0% CPU
>> load while being locked up is a dead giveaway[*]) I'd recommend using
>> the attached changes (which work best on top of a VM with David Lewis'
>> 64bit fixes applied) and see if that helps. For us, they made the
>> difference between running the server in Squeak and rewriting it in Java.
>>
>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>
>> [*] The 0% CPU lockups have sometimes been attributed to issues with
>> Linux wait functions. After having seen the havoc that Delay wrecks on
>> the system I don't buy these explanations any longer. A much simpler
>> (and more likely) explanation is that Delay went wild.
>>
>> Cheers,
>> - Andreas
>>
>>
>> ------------------------------------------------------------------------
>>
>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July
>> 2007 at 11:53:23 pm'!
>> "Change Set: SafeDelay
>> Date: 23 July 2007
>> Author: Andreas Raab
>>
>> This change set fixes a set of severe problems with concurrent use of
>> Delay. Previously, many of the delay-internal structures were modified
>> by the calling process which made it susceptible to being terminated
>> in the middle of manipulating these structures and leave Delay (and
>> consequently the entire system) in an inconsistent state.
>>
>> This change set fixes this problem by moving *all* manipulation of
>> Delay's internal structures out of the calling process. As a
>> side-effect it also removes the requirement of Delays being limited to
>> SmallInteger range; the new code has no limitation on the duration of
>> a delay.
>>
>> No tests are provided since outside of true asynchronous environments
>> (networks) it is basically impossible to recreate the situation
>> reliably."!
>>
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>> activate
>> "Private!! Make the receiver the Delay to be awoken when the next
>> timer interrupt occurs. This method should only be called from a block
>> protected by the AccessProtect semaphore."
>> TimerEventLoop ifNotNil:[^nil].
>> ActiveDelay := self.
>> ActiveDelayStartTime := Time millisecondClockValue.
>> ActiveDelayStartTime > resumptionTime ifTrue:[
>> ActiveDelay signalWaitingProcess.
>> SuspendedDelays isEmpty ifTrue:[
>> ActiveDelay := nil.
>> ActiveDelayStartTime := nil.
>> ] ifFalse:[SuspendedDelays removeFirst activate].
>> ] ifFalse:[
>> TimingSemaphore initSignals.
>> Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime.
>> ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> schedule
>> "Private!! Schedule this Delay, but return immediately rather than
>> waiting. The receiver's semaphore will be signalled when its delay
>> duration has elapsed."
>>
>> beingWaitedOn ifTrue: [self error: 'This Delay has already been
>> scheduled.'].
>>
>> TimerEventLoop ifNotNil:[^self scheduleEvent].
>> AccessProtect critical: [
>> beingWaitedOn := true.
>> resumptionTime := Time millisecondClockValue + delayDuration.
>> ActiveDelay == nil
>> ifTrue: [self activate]
>> ifFalse: [
>> resumptionTime < ActiveDelay resumptionTime
>> ifTrue: [
>> SuspendedDelays add: ActiveDelay.
>> self activate]
>> ifFalse: [SuspendedDelays add: self]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>> scheduleEvent
>> "Schedule this delay"
>> resumptionTime := Time millisecondClockValue + delayDuration.
>> AccessProtect critical:[
>> ScheduledDelay := self.
>> TimingSemaphore signal.
>> ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> unschedule
>> "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>
>> | done |
>> TimerEventLoop ifNotNil:[^self unscheduleEvent].
>> AccessProtect critical: [
>> done := false.
>> [done] whileFalse:
>> [SuspendedDelays remove: self ifAbsent: [done := true]].
>> ActiveDelay == self ifTrue: [
>> SuspendedDelays isEmpty
>> ifTrue: [
>> ActiveDelay := nil.
>> ActiveDelayStartTime := nil]
>> ifFalse: [
>> SuspendedDelays removeFirst activate]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>> unscheduleEvent
>> AccessProtect critical:[
>> FinishedDelay := self.
>> TimingSemaphore signal.
>> ].! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn
>> "Answer whether this delay is currently scheduled, e.g., being
>> waited on"
>> ^beingWaitedOn! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn: aBool
>> "Indicate whether this delay is currently scheduled, e.g., being
>> waited on"
>> beingWaitedOn := aBool! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>> delayDuration
>> ^delayDuration! !
>>
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'!
>> handleTimerEvent
>> "Handle a timer event; which can be either:
>> - a schedule request (ScheduledDelay notNil)
>> - an unschedule request (FinishedDelay notNil)
>> - a timer signal (not explicitly specified)
>> We check for timer expiry every time we get a signal."
>> | nextTick |
>> "Wait until there is work to do."
>> TimingSemaphore wait.
>>
>> "Process any schedule requests"
>> ScheduledDelay ifNotNil:[
>> "Schedule the given delay"
>> self scheduleDelay: ScheduledDelay.
>> ScheduledDelay := nil.
>> ].
>>
>> "Process any unschedule requests"
>> FinishedDelay ifNotNil:[
>> self unscheduleDelay: FinishedDelay.
>> FinishedDelay := nil.
>> ].
>>
>> "Check for clock wrap-around."
>> nextTick := Time millisecondClockValue.
>> nextTick < ActiveDelayStartTime ifTrue: [
>> "clock wrapped"
>> self saveResumptionTimes.
>> self restoreResumptionTimes.
>> ].
>> ActiveDelayStartTime := nextTick.
>>
>> "Signal any expired delays"
>> [ActiveDelay notNil and:[
>> Time millisecondClockValue >= ActiveDelay resumptionTime]]
>> whileTrue:[
>> ActiveDelay signalWaitingProcess.
>> SuspendedDelays isEmpty ifTrue:
>> [ActiveDelay := nil] ifFalse:[ActiveDelay :=
>> SuspendedDelays removeFirst].
>> ].
>>
>> "And signal when the next request is due. We sleep at most 1sec here
>> as a soft busy-loop so that we don't accidentally miss signals."
>> nextTick := Time millisecondClockValue + 1000.
>> ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay
>> resumptionTime].
>> nextTick := nextTick min: SmallInteger maxVal.
>>
>> "Since we have processed all outstanding requests, reset the
>> timing semaphore so
>> that only new work will wake us up again. Do this RIGHT BEFORE
>> setting the next
>> wakeup call from the VM because it is only signaled once so we
>> mustn't miss it."
>> TimingSemaphore initSignals.
>> Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'!
>> runTimerEventLoop
>> "Run the timer event loop."
>> [
>> [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>> ] on: Error do:[:ex|
>> "Clear out the process so it does't get killed"
>> TimerEventLoop := nil.
>> "Launch the old-style interrupt watcher"
>> self startTimerInterruptWatcher.
>> "And pass the exception on"
>> ex pass.
>> ].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> scheduleDelay: aDelay
>> "Private. Schedule this Delay."
>> aDelay beingWaitedOn: true.
>> ActiveDelay ifNil:[
>> ActiveDelay := aDelay
>> ] ifNotNil:[
>> aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>> SuspendedDelays add: ActiveDelay.
>> ActiveDelay := aDelay.
>> ] ifFalse: [SuspendedDelays add: aDelay].
>> ].
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'!
>> startTimerEventLoop
>> "Start the timer event loop"
>> "Delay startTimerEventLoop"
>> self stopTimerEventLoop.
>> self stopTimerInterruptWatcher.
>> AccessProtect := Semaphore forMutualExclusion.
>> ActiveDelayStartTime := Time millisecondClockValue.
>> SuspendedDelays := Heap withAll: (SuspendedDelays
>> ifNil:[#()])
>> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2
>> resumptionTime].
>> TimingSemaphore := Semaphore new.
>> RunTimerEventLoop := true.
>> TimerEventLoop := [self runTimerEventLoop] newProcess.
>> TimerEventLoop priority: Processor timingPriority.
>> TimerEventLoop resume.
>> TimingSemaphore signal. "get going"
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> startTimerInterruptWatcher
>> "Reset the class variables that keep track of active Delays and
>> re-start the timer interrupt watcher process. Any currently scheduled
>> delays are forgotten."
>> "Delay startTimerInterruptWatcher"
>> | p |
>> self stopTimerEventLoop.
>> self stopTimerInterruptWatcher.
>> TimingSemaphore := Semaphore new.
>> AccessProtect := Semaphore forMutualExclusion.
>> SuspendedDelays := SortedCollection sortBlock:
>> [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
>> ActiveDelay := nil.
>> p := [self timerInterruptWatcher] newProcess.
>> p priority: Processor timingPriority.
>> p resume.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'!
>> stopTimerEventLoop
>> "Stop the timer event loop"
>> RunTimerEventLoop := false.
>> TimingSemaphore signal.
>> TimerEventLoop := nil.! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'!
>> stopTimerInterruptWatcher
>> "Reset the class variables that keep track of active Delays and
>> re-start the timer interrupt watcher process. Any currently scheduled
>> delays are forgotten."
>> "Delay startTimerInterruptWatcher"
>> self primSignal: nil atMilliseconds: 0.
>> TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'!
>> unscheduleDelay: aDelay
>> "Private. Unschedule this Delay."
>> ActiveDelay == aDelay ifTrue: [
>> SuspendedDelays isEmpty ifTrue:[
>> ActiveDelay := nil.
>> ] ifFalse: [
>> ActiveDelay := SuspendedDelays removeFirst.
>> ]
>> ] ifFalse:[
>> SuspendedDelays remove: aDelay ifAbsent: [].
>> ].
>> aDelay beingWaitedOn: false.! !
>>
>> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007
>> 18:16'!
>> initialize
>> "Delay initialize"
>> self startTimerEventLoop.! !
>>
>> Delay initialize!
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>
More information about the Squeak-dev
mailing list
|