Delay and Server reliability

Janko Mivšek janko.mivsek at eranova.si
Tue Jul 24 08:07:06 UTC 2007


Hi Andreas,

That's very important patch and very interesting to me too, because I'm 
just deciding to put some of my public Aida/Web websites from VW to 
Squeak and I was afraid of such issues as one you just solved.

Is there any chance that this patch goes to 3.10?

Best regards
Janko

Andreas Raab wrote:
> Hi -
> 
> We recently had some "fun" chasing server lockups (with truly awful 
> uptimes of about a day or less before things went downhill) and were 
> finally able to track a huge portion of it down to problems with Delay. 
> The effect we were seeing on our servers was that the system would 
> randomly lock up and either go down to 0% CPU or 100% CPU.
> 
> After poking it with a USR1 signal (which, in our VMs is hooked up such 
> that it prints all the call stacks in the image; it's a life-safer if 
> you need to debug these issues) we usually found that all processes were 
> waiting on Delay's AccessProtect (0%) or alternatively found that a 
> particular process (the event tickler) would sit in a tight loop 
> swallowing repeated errors complaining that "this delay is already 
> scheduled".
> 
> After hours and hours of testing, debugging, and a little stroke of luck 
> we finally found out that all of these issues were caused by the fact 
> that Delay's internal structures are updated by the calling process 
> (insertion into and removal from SuspendedDelays) which renders the 
> process susceptible to being terminated in the midst of updating these 
> structures.
> 
> If you look at the code, this is obviously an issue because if (for 
> example) the calling process gets terminated while it's resorting 
> SuspendedDelays the result is unpredictable. This is in particular an 
> issue because the calling process is often running at a relatively low 
> priority so interruption by other, high-priority processes is a common 
> case. And if any of these higher priority processes kills the one that 
> just happens to execute SortedCollection>>remove: anything can happen - 
> from leaving a later delay in front of an earlier one (one of the cases 
> we had indicated that this was just what had happened) to errors when 
> doing the next insert/remove ("trying to evaluate a block that is 
> already evaluated") to many more weirdnesses. Unfortunately, it is 
> basically impossible to recreate this problem under any kind of 
> controlled circumstances, mostly because you need a source of events 
> that is truly independent from your time source.
> 
> As a consequence of our findings we rewrote Delay to deal with these 
> issues properly and, having deployed the changes about ten days ago on 
> our servers, all of these sources of problems simply vanished. We 
> haven't had a single server problem which we couldn't attribute to our 
> own stupidity (such as running out of disk space ;-)
> 
> The changes will in particular be helpful to you if you:
> * run network servers
> * fork processes to handle network requests
> * terminate these processes explicitly (on error conditions for example)
> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
> 
> If you have seen random, unexplained lockups of your server (0% CPU load 
> while being locked up is a dead giveaway[*]) I'd recommend using the 
> attached changes (which work best on top of a VM with David Lewis' 64bit 
> fixes applied) and see if that helps. For us, they made the difference 
> between running the server in Squeak and rewriting it in Java.
> 
> I've also filed this as http://bugs.squeak.org/view.php?id=6576
> 
> [*] The 0% CPU lockups have sometimes been attributed to issues with 
> Linux wait functions. After having seen the havoc that Delay wrecks on 
> the system I don't buy these explanations any longer. A much simpler 
> (and more likely) explanation is that Delay went wild.
> 
> Cheers,
>   - Andreas
> 
> 
> ------------------------------------------------------------------------
> 
> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July 2007 at 11:53:23 pm'!
> "Change Set:		SafeDelay
> Date:			23 July 2007
> Author:			Andreas Raab
> 
> This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state.
> 
> This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay.
> 
> No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."!
> 
> 
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
> activate
> 	"Private!! Make the receiver the Delay to be awoken when the next timer interrupt occurs. This method should only be called from a block protected by the AccessProtect semaphore."
> 	TimerEventLoop ifNotNil:[^nil].
> 	ActiveDelay := self.
> 	ActiveDelayStartTime := Time millisecondClockValue.
> 	ActiveDelayStartTime > resumptionTime ifTrue:[
> 		ActiveDelay signalWaitingProcess.
> 		SuspendedDelays isEmpty ifTrue:[
> 			ActiveDelay := nil.
> 			ActiveDelayStartTime := nil.
> 		] ifFalse:[SuspendedDelays removeFirst activate].
> 	] ifFalse:[
> 		TimingSemaphore initSignals.
> 		Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime.
> 	].! !
> 
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
> schedule
> 	"Private!! Schedule this Delay, but return immediately rather than waiting. The receiver's semaphore will be signalled when its delay duration has elapsed."
> 
> 	beingWaitedOn ifTrue: [self error: 'This Delay has already been scheduled.'].
> 
> 	TimerEventLoop ifNotNil:[^self scheduleEvent].
> 	AccessProtect critical: [
> 		beingWaitedOn := true.
> 		resumptionTime := Time millisecondClockValue + delayDuration.
> 		ActiveDelay == nil
> 			ifTrue: [self activate]
> 			ifFalse: [
> 				resumptionTime < ActiveDelay resumptionTime
> 					ifTrue: [
> 						SuspendedDelays add: ActiveDelay.
> 						self activate]
> 					ifFalse: [SuspendedDelays add: self]]].
> ! !
> 
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
> scheduleEvent
> 	"Schedule this delay"
> 	resumptionTime := Time millisecondClockValue + delayDuration.
> 	AccessProtect critical:[
> 		ScheduledDelay := self.
> 		TimingSemaphore signal.
> 	].! !
> 
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
> unschedule
> 	"Unschedule this Delay. Do nothing if it wasn't scheduled."
> 
> 	| done |
> 	TimerEventLoop ifNotNil:[^self unscheduleEvent].
> 	AccessProtect critical: [
> 		done := false.
> 		[done] whileFalse:
> 			[SuspendedDelays remove: self ifAbsent: [done := true]].
> 		ActiveDelay == self ifTrue: [
> 			SuspendedDelays isEmpty
> 				ifTrue: [
> 					ActiveDelay := nil.
> 					ActiveDelayStartTime := nil]
> 				ifFalse: [
> 					SuspendedDelays removeFirst activate]]].
> ! !
> 
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
> unscheduleEvent
> 	AccessProtect critical:[
> 		FinishedDelay := self.
> 		TimingSemaphore signal.
> 	].! !
> 
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
> beingWaitedOn
> 	"Answer whether this delay is currently scheduled, e.g., being waited on"
> 	^beingWaitedOn! !
> 
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
> beingWaitedOn: aBool
> 	"Indicate whether this delay is currently scheduled, e.g., being waited on"
> 	beingWaitedOn := aBool! !
> 
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
> delayDuration
> 	^delayDuration! !
> 
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'!
> handleTimerEvent
> 	"Handle a timer event; which can be either:
> 		- a schedule request (ScheduledDelay notNil)
> 		- an unschedule request (FinishedDelay notNil)
> 		- a timer signal (not explicitly specified)
> 	We check for timer expiry every time we get a signal."
> 	| nextTick |
> 	"Wait until there is work to do."
> 	TimingSemaphore wait.
> 
> 	"Process any schedule requests"
> 	ScheduledDelay ifNotNil:[
> 		"Schedule the given delay"
> 		self scheduleDelay: ScheduledDelay.
> 		ScheduledDelay := nil.
> 	].
> 
> 	"Process any unschedule requests"
> 	FinishedDelay ifNotNil:[
> 		self unscheduleDelay: FinishedDelay.
> 		FinishedDelay := nil.
> 	].
> 
> 	"Check for clock wrap-around."
> 	nextTick := Time millisecondClockValue.
> 	nextTick < ActiveDelayStartTime ifTrue: [
> 		"clock wrapped"
> 		self saveResumptionTimes.
> 		self restoreResumptionTimes.
> 	].
> 	ActiveDelayStartTime := nextTick.
> 
> 	"Signal any expired delays"
> 	[ActiveDelay notNil and:[
> 		Time millisecondClockValue >= ActiveDelay resumptionTime]] whileTrue:[
> 			ActiveDelay signalWaitingProcess.
> 			SuspendedDelays isEmpty 
> 				ifTrue: [ActiveDelay := nil] 
> 				ifFalse:[ActiveDelay := SuspendedDelays removeFirst].
> 		].
> 
> 	"And signal when the next request is due. We sleep at most 1sec here
> 	as a soft busy-loop so that we don't accidentally miss signals."
> 	nextTick := Time millisecondClockValue + 1000.
> 	ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay resumptionTime].
> 	nextTick := nextTick min: SmallInteger maxVal.
> 
> 	"Since we have processed all outstanding requests, reset the timing semaphore so
> 	that only new work will wake us up again. Do this RIGHT BEFORE setting the next
> 	wakeup call from the VM because it is only signaled once so we mustn't miss it."
> 	TimingSemaphore initSignals.
> 	Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
> ! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'!
> runTimerEventLoop
> 	"Run the timer event loop."
> 	[
> 		[RunTimerEventLoop] whileTrue: [self handleTimerEvent]
> 	] on: Error do:[:ex|
> 		"Clear out the process so it does't get killed"
> 		TimerEventLoop := nil.
> 		"Launch the old-style interrupt watcher"
> 		self startTimerInterruptWatcher.
> 		"And pass the exception on"
> 		ex pass.
> 	].! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
> scheduleDelay: aDelay
> 	"Private. Schedule this Delay."
> 	aDelay beingWaitedOn: true.
> 	ActiveDelay ifNil:[
> 		ActiveDelay := aDelay
> 	] ifNotNil:[
> 		aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
> 			SuspendedDelays add: ActiveDelay.
> 			ActiveDelay := aDelay.
> 		] ifFalse: [SuspendedDelays add: aDelay].
> 	].
> ! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'!
> startTimerEventLoop
> 	"Start the timer event loop"
> 	"Delay startTimerEventLoop"
> 	self stopTimerEventLoop.
> 	self stopTimerInterruptWatcher.
> 	AccessProtect := Semaphore forMutualExclusion.
> 	ActiveDelayStartTime := Time millisecondClockValue.
> 	SuspendedDelays := 
> 		Heap withAll: (SuspendedDelays ifNil:[#()])
> 			sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
> 	TimingSemaphore := Semaphore new.
> 	RunTimerEventLoop := true.
> 	TimerEventLoop := [self runTimerEventLoop] newProcess.
> 	TimerEventLoop priority: Processor timingPriority.
> 	TimerEventLoop resume.
> 	TimingSemaphore signal. "get going"
> ! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
> startTimerInterruptWatcher
> 	"Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten."
> 	"Delay startTimerInterruptWatcher"
> 	| p |
> 	self stopTimerEventLoop.
> 	self stopTimerInterruptWatcher.
> 	TimingSemaphore := Semaphore new.
> 	AccessProtect := Semaphore forMutualExclusion.
> 	SuspendedDelays := 
> 		SortedCollection sortBlock: 
> 			[:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
> 	ActiveDelay := nil.
> 	p := [self timerInterruptWatcher] newProcess.
> 	p priority: Processor timingPriority.
> 	p resume.
> ! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'!
> stopTimerEventLoop
> 	"Stop the timer event loop"
> 	RunTimerEventLoop := false.
> 	TimingSemaphore signal.
> 	TimerEventLoop := nil.! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'!
> stopTimerInterruptWatcher
> 	"Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten."
> 	"Delay startTimerInterruptWatcher"
> 	self primSignal: nil atMilliseconds: 0.
> 	TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
> 
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'!
> unscheduleDelay: aDelay
> 	"Private. Unschedule this Delay."
> 	ActiveDelay == aDelay ifTrue: [
> 		SuspendedDelays isEmpty ifTrue:[
> 			ActiveDelay := nil.
> 		] ifFalse: [
> 			ActiveDelay := SuspendedDelays removeFirst.
> 		]
> 	] ifFalse:[
> 		SuspendedDelays remove: aDelay ifAbsent: [].
> 	].
> 	aDelay beingWaitedOn: false.! !
> 
> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007 18:16'!
> initialize
> 	"Delay initialize"
> 	self startTimerEventLoop.! !
> 
> Delay initialize!
> 
> 
> ------------------------------------------------------------------------
> 
> 

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si



More information about the Squeak-dev mailing list