[Seaside] Delay and Server reliability

Tue Jul 24 07:45:43 UTC 2007

Hi -

We recently had some "fun" chasing server lockups (with truly awful  
uptimes of about a day or less before things went downhill) and were  
finally able to track a huge portion of it down to problems with  
Delay. The effect we were seeing on our servers was that the system  
would randomly lock up and either go down to 0% CPU or 100% CPU.

After poking it with a USR1 signal (which, in our VMs is hooked up  
such that it prints all the call stacks in the image; it's a life- 
safer if you need to debug these issues) we usually found that all  
processes were waiting on Delay's AccessProtect (0%) or alternatively  
found that a particular process (the event tickler) would sit in a  
tight loop swallowing repeated errors complaining that "this delay is  
already scheduled".

After hours and hours of testing, debugging, and a little stroke of  
luck we finally found out that all of these issues were caused by the  
fact that Delay's internal structures are updated by the calling  
process (insertion into and removal from SuspendedDelays) which  
renders the process susceptible to being terminated in the midst of  
updating these structures.

If you look at the code, this is obviously an issue because if (for  
example) the calling process gets terminated while it's resorting  
SuspendedDelays the result is unpredictable. This is in particular an  
issue because the calling process is often running at a relatively  
low priority so interruption by other, high-priority processes is a  
common case. And if any of these higher priority processes kills the  
one that just happens to execute SortedCollection>>remove: anything  
can happen - from leaving a later delay in front of an earlier one  
(one of the cases we had indicated that this was just what had  
happened) to errors when doing the next insert/remove ("trying to  
evaluate a block that is already evaluated") to many more  
weirdnesses. Unfortunately, it is basically impossible to recreate  
this problem under any kind of controlled circumstances, mostly  
because you need a source of events that is truly independent from  
your time source.

As a consequence of our findings we rewrote Delay to deal with these  
issues properly and, having deployed the changes about ten days ago  
on our servers, all of these sources of problems simply vanished. We  
haven't had a single server problem which we couldn't attribute to  
our own stupidity (such as running out of disk space ;-)

The changes will in particular be helpful to you if you:
* run network servers
* fork processes to handle network requests
* terminate these processes explicitly (on error conditions for example)
* use Semaphore>>waitTimeoutMsecs: (all socket functions use this)

If you have seen random, unexplained lockups of your server (0% CPU  
load while being locked up is a dead giveaway[*]) I'd recommend using  
the attached changes (which work best on top of a VM with David  
Lewis' 64bit fixes applied) and see if that helps. For us, they made  
the difference between running the server in Squeak and rewriting it  
in Java.

I've also filed this as http://bugs.squeak.org/view.php?id=6576

[*] The 0% CPU lockups have sometimes been attributed to issues with  
Linux wait functions. After having seen the havoc that Delay wrecks  
on the system I don't buy these explanations any longer. A much  
simpler (and more likely) explanation is that Delay went wild.

Cheers,
   - Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SafeDelay.cs
Type: text/x-csharp
Size: 8835 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/seaside/attachments/20070724/49c046a8/SafeDelay-0001.bin
-------------- next part --------------