Delay and Server reliability
Andreas Raab
andreas.raab at gmx.de
Tue Jul 24 07:45:43 UTC 2007
Hi -
We recently had some "fun" chasing server lockups (with truly awful
uptimes of about a day or less before things went downhill) and were
finally able to track a huge portion of it down to problems with Delay.
The effect we were seeing on our servers was that the system would
randomly lock up and either go down to 0% CPU or 100% CPU.
After poking it with a USR1 signal (which, in our VMs is hooked up such
that it prints all the call stacks in the image; it's a life-safer if
you need to debug these issues) we usually found that all processes were
waiting on Delay's AccessProtect (0%) or alternatively found that a
particular process (the event tickler) would sit in a tight loop
swallowing repeated errors complaining that "this delay is already
scheduled".
After hours and hours of testing, debugging, and a little stroke of luck
we finally found out that all of these issues were caused by the fact
that Delay's internal structures are updated by the calling process
(insertion into and removal from SuspendedDelays) which renders the
process susceptible to being terminated in the midst of updating these
structures.
If you look at the code, this is obviously an issue because if (for
example) the calling process gets terminated while it's resorting
SuspendedDelays the result is unpredictable. This is in particular an
issue because the calling process is often running at a relatively low
priority so interruption by other, high-priority processes is a common
case. And if any of these higher priority processes kills the one that
just happens to execute SortedCollection>>remove: anything can happen -
from leaving a later delay in front of an earlier one (one of the cases
we had indicated that this was just what had happened) to errors when
doing the next insert/remove ("trying to evaluate a block that is
already evaluated") to many more weirdnesses. Unfortunately, it is
basically impossible to recreate this problem under any kind of
controlled circumstances, mostly because you need a source of events
that is truly independent from your time source.
As a consequence of our findings we rewrote Delay to deal with these
issues properly and, having deployed the changes about ten days ago on
our servers, all of these sources of problems simply vanished. We
haven't had a single server problem which we couldn't attribute to our
own stupidity (such as running out of disk space ;-)
The changes will in particular be helpful to you if you:
* run network servers
* fork processes to handle network requests
* terminate these processes explicitly (on error conditions for example)
* use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
If you have seen random, unexplained lockups of your server (0% CPU load
while being locked up is a dead giveaway[*]) I'd recommend using the
attached changes (which work best on top of a VM with David Lewis' 64bit
fixes applied) and see if that helps. For us, they made the difference
between running the server in Squeak and rewriting it in Java.
I've also filed this as http://bugs.squeak.org/view.php?id=6576
[*] The 0% CPU lockups have sometimes been attributed to issues with
Linux wait functions. After having seen the havoc that Delay wrecks on
the system I don't buy these explanations any longer. A much simpler
(and more likely) explanation is that Delay went wild.
Cheers,
- Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SafeDelay.cs
Type: text/x-csharp
Size: 8835 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20070724/cd6572d4/SafeDelay.bin
More information about the Squeak-dev
mailing list
|