Delay and Server reliability
stephane ducasse
stephane.ducasse at free.fr
Tue Jul 24 16:32:36 UTC 2007
Yes.
I imagine the pain you got to chase it....
This kind of bug huge pain.
Stef
On 24 juil. 07, at 10:17, Andreas Raab wrote:
> You can apply the fix yourself; it works in all Squeak versions
> that I'm aware of (and if not, you'll find out really quickly ;-)
> This is just the kind of thing for which I wanted to see some sort
> of "standard package" for so that people across various Squeak
> versions can benefit from it.
>
> Cheers,
> - Andreas
>
> Janko Mivšek wrote:
>> Hi Andreas,
>> That's very important patch and very interesting to me too,
>> because I'm just deciding to put some of my public Aida/Web
>> websites from VW to Squeak and I was afraid of such issues as one
>> you just solved.
>> Is there any chance that this patch goes to 3.10?
>> Best regards
>> Janko
>> Andreas Raab wrote:
>>> Hi -
>>>
>>> We recently had some "fun" chasing server lockups (with truly
>>> awful uptimes of about a day or less before things went downhill)
>>> and were finally able to track a huge portion of it down to
>>> problems with Delay. The effect we were seeing on our servers was
>>> that the system would randomly lock up and either go down to 0%
>>> CPU or 100% CPU.
>>>
>>> After poking it with a USR1 signal (which, in our VMs is hooked
>>> up such that it prints all the call stacks in the image; it's a
>>> life-safer if you need to debug these issues) we usually found
>>> that all processes were waiting on Delay's AccessProtect (0%) or
>>> alternatively found that a particular process (the event tickler)
>>> would sit in a tight loop swallowing repeated errors complaining
>>> that "this delay is already scheduled".
>>>
>>> After hours and hours of testing, debugging, and a little stroke
>>> of luck we finally found out that all of these issues were caused
>>> by the fact that Delay's internal structures are updated by the
>>> calling process (insertion into and removal from SuspendedDelays)
>>> which renders the process susceptible to being terminated in the
>>> midst of updating these structures.
>>>
>>> If you look at the code, this is obviously an issue because if
>>> (for example) the calling process gets terminated while it's
>>> resorting SuspendedDelays the result is unpredictable. This is in
>>> particular an issue because the calling process is often running
>>> at a relatively low priority so interruption by other, high-
>>> priority processes is a common case. And if any of these higher
>>> priority processes kills the one that just happens to execute
>>> SortedCollection>>remove: anything can happen - from leaving a
>>> later delay in front of an earlier one (one of the cases we had
>>> indicated that this was just what had happened) to errors when
>>> doing the next insert/remove ("trying to evaluate a block that is
>>> already evaluated") to many more weirdnesses. Unfortunately, it
>>> is basically impossible to recreate this problem under any kind
>>> of controlled circumstances, mostly because you need a source of
>>> events that is truly independent from your time source.
>>>
>>> As a consequence of our findings we rewrote Delay to deal with
>>> these issues properly and, having deployed the changes about ten
>>> days ago on our servers, all of these sources of problems simply
>>> vanished. We haven't had a single server problem which we
>>> couldn't attribute to our own stupidity (such as running out of
>>> disk space ;-)
>>>
>>> The changes will in particular be helpful to you if you:
>>> * run network servers
>>> * fork processes to handle network requests
>>> * terminate these processes explicitly (on error conditions for
>>> example)
>>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>>
>>> If you have seen random, unexplained lockups of your server (0%
>>> CPU load while being locked up is a dead giveaway[*]) I'd
>>> recommend using the attached changes (which work best on top of a
>>> VM with David Lewis' 64bit fixes applied) and see if that helps.
>>> For us, they made the difference between running the server in
>>> Squeak and rewriting it in Java.
>>>
>>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>>
>>> [*] The 0% CPU lockups have sometimes been attributed to issues
>>> with Linux wait functions. After having seen the havoc that Delay
>>> wrecks on the system I don't buy these explanations any longer. A
>>> much simpler (and more likely) explanation is that Delay went wild.
>>>
>>> Cheers,
>>> - Andreas
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----
>>>
>>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23
>>> July 2007 at 11:53:23 pm'!
>>> "Change Set: SafeDelay
>>> Date: 23 July 2007
>>> Author: Andreas Raab
>>>
>>> This change set fixes a set of severe problems with concurrent
>>> use of Delay. Previously, many of the delay-internal structures
>>> were modified by the calling process which made it susceptible to
>>> being terminated in the middle of manipulating these structures
>>> and leave Delay (and consequently the entire system) in an
>>> inconsistent state.
>>>
>>> This change set fixes this problem by moving *all* manipulation
>>> of Delay's internal structures out of the calling process. As a
>>> side-effect it also removes the requirement of Delays being
>>> limited to SmallInteger range; the new code has no limitation on
>>> the duration of a delay.
>>>
>>> No tests are provided since outside of true asynchronous
>>> environments (networks) it is basically impossible to recreate
>>> the situation reliably."!
>>>
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>>> activate
>>> "Private!! Make the receiver the Delay to be awoken when the
>>> next timer interrupt occurs. This method should only be called
>>> from a block protected by the AccessProtect semaphore."
>>> TimerEventLoop ifNotNil:[^nil].
>>> ActiveDelay := self.
>>> ActiveDelayStartTime := Time millisecondClockValue.
>>> ActiveDelayStartTime > resumptionTime ifTrue:[
>>> ActiveDelay signalWaitingProcess.
>>> SuspendedDelays isEmpty ifTrue:[
>>> ActiveDelay := nil.
>>> ActiveDelayStartTime := nil.
>>> ] ifFalse:[SuspendedDelays removeFirst activate].
>>> ] ifFalse:[
>>> TimingSemaphore initSignals.
>>> Delay primSignal: TimingSemaphore atMilliseconds:
>>> resumptionTime.
>>> ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> schedule
>>> "Private!! Schedule this Delay, but return immediately rather
>>> than waiting. The receiver's semaphore will be signalled when its
>>> delay duration has elapsed."
>>>
>>> beingWaitedOn ifTrue: [self error: 'This Delay has already
>>> been scheduled.'].
>>>
>>> TimerEventLoop ifNotNil:[^self scheduleEvent].
>>> AccessProtect critical: [
>>> beingWaitedOn := true.
>>> resumptionTime := Time millisecondClockValue +
>>> delayDuration.
>>> ActiveDelay == nil
>>> ifTrue: [self activate]
>>> ifFalse: [
>>> resumptionTime < ActiveDelay resumptionTime
>>> ifTrue: [
>>> SuspendedDelays add: ActiveDelay.
>>> self activate]
>>> ifFalse: [SuspendedDelays add: self]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>>> scheduleEvent
>>> "Schedule this delay"
>>> resumptionTime := Time millisecondClockValue + delayDuration.
>>> AccessProtect critical:[
>>> ScheduledDelay := self.
>>> TimingSemaphore signal.
>>> ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> unschedule
>>> "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>>
>>> | done |
>>> TimerEventLoop ifNotNil:[^self unscheduleEvent].
>>> AccessProtect critical: [
>>> done := false.
>>> [done] whileFalse:
>>> [SuspendedDelays remove: self ifAbsent: [done := true]].
>>> ActiveDelay == self ifTrue: [
>>> SuspendedDelays isEmpty
>>> ifTrue: [
>>> ActiveDelay := nil.
>>> ActiveDelayStartTime := nil]
>>> ifFalse: [
>>> SuspendedDelays removeFirst activate]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>>> unscheduleEvent
>>> AccessProtect critical:[
>>> FinishedDelay := self.
>>> TimingSemaphore signal.
>>> ].! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn
>>> "Answer whether this delay is currently scheduled, e.g.,
>>> being waited on"
>>> ^beingWaitedOn! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn: aBool
>>> "Indicate whether this delay is currently scheduled, e.g.,
>>> being waited on"
>>> beingWaitedOn := aBool! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>>> delayDuration
>>> ^delayDuration! !
>>>
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007
>>> 10:35'!
>>> handleTimerEvent
>>> "Handle a timer event; which can be either:
>>> - a schedule request (ScheduledDelay notNil)
>>> - an unschedule request (FinishedDelay notNil)
>>> - a timer signal (not explicitly specified)
>>> We check for timer expiry every time we get a signal."
>>> | nextTick |
>>> "Wait until there is work to do."
>>> TimingSemaphore wait.
>>>
>>> "Process any schedule requests"
>>> ScheduledDelay ifNotNil:[
>>> "Schedule the given delay"
>>> self scheduleDelay: ScheduledDelay.
>>> ScheduledDelay := nil.
>>> ].
>>>
>>> "Process any unschedule requests"
>>> FinishedDelay ifNotNil:[
>>> self unscheduleDelay: FinishedDelay.
>>> FinishedDelay := nil.
>>> ].
>>>
>>> "Check for clock wrap-around."
>>> nextTick := Time millisecondClockValue.
>>> nextTick < ActiveDelayStartTime ifTrue: [
>>> "clock wrapped"
>>> self saveResumptionTimes.
>>> self restoreResumptionTimes.
>>> ].
>>> ActiveDelayStartTime := nextTick.
>>>
>>> "Signal any expired delays"
>>> [ActiveDelay notNil and:[
>>> Time millisecondClockValue >= ActiveDelay
>>> resumptionTime]] whileTrue:[
>>> ActiveDelay signalWaitingProcess.
>>> SuspendedDelays isEmpty ifTrue:
>>> [ActiveDelay := nil] ifFalse:[ActiveDelay :=
>>> SuspendedDelays removeFirst].
>>> ].
>>>
>>> "And signal when the next request is due. We sleep at most
>>> 1sec here
>>> as a soft busy-loop so that we don't accidentally miss signals."
>>> nextTick := Time millisecondClockValue + 1000.
>>> ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay
>>> resumptionTime].
>>> nextTick := nextTick min: SmallInteger maxVal.
>>>
>>> "Since we have processed all outstanding requests, reset the
>>> timing semaphore so
>>> that only new work will wake us up again. Do this RIGHT
>>> BEFORE setting the next
>>> wakeup call from the VM because it is only signaled once so
>>> we mustn't miss it."
>>> TimingSemaphore initSignals.
>>> Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007
>>> 09:04'!
>>> runTimerEventLoop
>>> "Run the timer event loop."
>>> [
>>> [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>>> ] on: Error do:[:ex|
>>> "Clear out the process so it does't get killed"
>>> TimerEventLoop := nil.
>>> "Launch the old-style interrupt watcher"
>>> self startTimerInterruptWatcher.
>>> "And pass the exception on"
>>> ex pass.
>>> ].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007
>>> 22:32'!
>>> scheduleDelay: aDelay
>>> "Private. Schedule this Delay."
>>> aDelay beingWaitedOn: true.
>>> ActiveDelay ifNil:[
>>> ActiveDelay := aDelay
>>> ] ifNotNil:[
>>> aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>>> SuspendedDelays add: ActiveDelay.
>>> ActiveDelay := aDelay.
>>> ] ifFalse: [SuspendedDelays add: aDelay].
>>> ].
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007
>>> 10:18'!
>>> startTimerEventLoop
>>> "Start the timer event loop"
>>> "Delay startTimerEventLoop"
>>> self stopTimerEventLoop.
>>> self stopTimerInterruptWatcher.
>>> AccessProtect := Semaphore forMutualExclusion.
>>> ActiveDelayStartTime := Time millisecondClockValue.
>>> SuspendedDelays := Heap withAll: (SuspendedDelays
>>> ifNil:[#()])
>>> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2
>>> resumptionTime].
>>> TimingSemaphore := Semaphore new.
>>> RunTimerEventLoop := true.
>>> TimerEventLoop := [self runTimerEventLoop] newProcess.
>>> TimerEventLoop priority: Processor timingPriority.
>>> TimerEventLoop resume.
>>> TimingSemaphore signal. "get going"
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007
>>> 22:32'!
>>> startTimerInterruptWatcher
>>> "Reset the class variables that keep track of active Delays
>>> and re-start the timer interrupt watcher process. Any currently
>>> scheduled delays are forgotten."
>>> "Delay startTimerInterruptWatcher"
>>> | p |
>>> self stopTimerEventLoop.
>>> self stopTimerInterruptWatcher.
>>> TimingSemaphore := Semaphore new.
>>> AccessProtect := Semaphore forMutualExclusion.
>>> SuspendedDelays := SortedCollection
>>> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2
>>> resumptionTime].
>>> ActiveDelay := nil.
>>> p := [self timerInterruptWatcher] newProcess.
>>> p priority: Processor timingPriority.
>>> p resume.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007
>>> 21:26'!
>>> stopTimerEventLoop
>>> "Stop the timer event loop"
>>> RunTimerEventLoop := false.
>>> TimingSemaphore signal.
>>> TimerEventLoop := nil.! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007
>>> 21:32'!
>>> stopTimerInterruptWatcher
>>> "Reset the class variables that keep track of active Delays
>>> and re-start the timer interrupt watcher process. Any currently
>>> scheduled delays are forgotten."
>>> "Delay startTimerInterruptWatcher"
>>> self primSignal: nil atMilliseconds: 0.
>>> TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007
>>> 22:33'!
>>> unscheduleDelay: aDelay
>>> "Private. Unschedule this Delay."
>>> ActiveDelay == aDelay ifTrue: [
>>> SuspendedDelays isEmpty ifTrue:[
>>> ActiveDelay := nil.
>>> ] ifFalse: [
>>> ActiveDelay := SuspendedDelays removeFirst.
>>> ]
>>> ] ifFalse:[
>>> SuspendedDelays remove: aDelay ifAbsent: [].
>>> ].
>>> aDelay beingWaitedOn: false.! !
>>>
>>> !Delay class methodsFor: 'class initialization' stamp: 'ar
>>> 7/11/2007 18:16'!
>>> initialize
>>> "Delay initialize"
>>> self startTimerEventLoop.! !
>>>
>>> Delay initialize!
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----
>>>
>>>
>
>
>
More information about the Squeak-dev
mailing list
|