Delay and Server reliability

Ron Teitelbaum Ron at USMedRec.com
Tue Jul 24 13:46:21 UTC 2007


Hi Andreas,

This is Terrific!!  Thank you for doing it!!

Ron

> -----Original Message-----
> From: squeak-dev-bounces at lists.squeakfoundation.org [mailto:squeak-dev-
> bounces at lists.squeakfoundation.org] On Behalf Of Andreas Raab
> Sent: Tuesday, July 24, 2007 3:46 AM
> To: The general-purpose Squeak developers list
> Subject: Delay and Server reliability
> 
> Hi -
> 
> We recently had some "fun" chasing server lockups (with truly awful
> uptimes of about a day or less before things went downhill) and were
> finally able to track a huge portion of it down to problems with Delay.
> The effect we were seeing on our servers was that the system would
> randomly lock up and either go down to 0% CPU or 100% CPU.
> 
> After poking it with a USR1 signal (which, in our VMs is hooked up such
> that it prints all the call stacks in the image; it's a life-safer if
> you need to debug these issues) we usually found that all processes were
> waiting on Delay's AccessProtect (0%) or alternatively found that a
> particular process (the event tickler) would sit in a tight loop
> swallowing repeated errors complaining that "this delay is already
> scheduled".
> 
> After hours and hours of testing, debugging, and a little stroke of luck
> we finally found out that all of these issues were caused by the fact
> that Delay's internal structures are updated by the calling process
> (insertion into and removal from SuspendedDelays) which renders the
> process susceptible to being terminated in the midst of updating these
> structures.
> 
> If you look at the code, this is obviously an issue because if (for
> example) the calling process gets terminated while it's resorting
> SuspendedDelays the result is unpredictable. This is in particular an
> issue because the calling process is often running at a relatively low
> priority so interruption by other, high-priority processes is a common
> case. And if any of these higher priority processes kills the one that
> just happens to execute SortedCollection>>remove: anything can happen -
> from leaving a later delay in front of an earlier one (one of the cases
> we had indicated that this was just what had happened) to errors when
> doing the next insert/remove ("trying to evaluate a block that is
> already evaluated") to many more weirdnesses. Unfortunately, it is
> basically impossible to recreate this problem under any kind of
> controlled circumstances, mostly because you need a source of events
> that is truly independent from your time source.
> 
> As a consequence of our findings we rewrote Delay to deal with these
> issues properly and, having deployed the changes about ten days ago on
> our servers, all of these sources of problems simply vanished. We
> haven't had a single server problem which we couldn't attribute to our
> own stupidity (such as running out of disk space ;-)
> 
> The changes will in particular be helpful to you if you:
> * run network servers
> * fork processes to handle network requests
> * terminate these processes explicitly (on error conditions for example)
> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
> 
> If you have seen random, unexplained lockups of your server (0% CPU load
> while being locked up is a dead giveaway[*]) I'd recommend using the
> attached changes (which work best on top of a VM with David Lewis' 64bit
> fixes applied) and see if that helps. For us, they made the difference
> between running the server in Squeak and rewriting it in Java.
> 
> I've also filed this as http://bugs.squeak.org/view.php?id=6576
> 
> [*] The 0% CPU lockups have sometimes been attributed to issues with
> Linux wait functions. After having seen the havoc that Delay wrecks on
> the system I don't buy these explanations any longer. A much simpler
> (and more likely) explanation is that Delay went wild.
> 
> Cheers,
>    - Andreas




More information about the Squeak-dev mailing list