An analysis of interrupt behavior (was: Re:
6293LowSpaceWatcherFix-dtl considered harmful)
Andreas Raab
andreas.raab at gmx.de
Fri Apr 1 05:30:09 UTC 2005
Hi David -
Well, let's see if I can reverse engineer the problem from the supposed
solution ;-)
But first, we need an understanding of what the code in
Project>>interruptName: was trying to achieve originally: When we
encounter a user interrupt or a low-space condition, a semaphore in the
VM gets signaled and a (higher priority process) reacts to the semaphore[*].
[*] I will ignore the subtle difference between the VM signaling the
user interrupt semaphore and it being signaled by the event tickler
since I don't think this matters (but we'll see - I'm writing this as I
go ;-)
This higher priority process (indirectly) invokes
Project>>interruptName: which then needs to figure out which process to
actually suspend. This is not at all trivial since the VM does not
retain sufficient information about the process that was active when the
semaphore got signaled - and therefore Project>>interruptName: tries to
infer which process got interrupted (by using
Processor>>preemptedProcess) and then put this process to sleep for
further debugging.
The change in 6293LowSpaceWatcherFix-dtl made it so that it was *always*
the main morphic process which gets interrupted (instead of Processor
preemptedProcess). And while this might be a reasonable first order
guess, it is completely missing the point that the main morphic process
isn't always the runaway process.
[Sidenote: I was just really surprised that "[[true] whileTrue] fork"
didn't totally lock up 3.8 as I would have expected with the change -
but it turns out that the event tickler has the interesting side effect
that it allows the main morphic process to run once every 500ms,
therefore giving one a barely usable system. But of course, hitting
alt-period will never interrupt that runaway process either so changing
this to "[[true] whileTrue] forkAt: Processor activeProcess priority+1"
still kills 3.8 completely]
But to get us back on track, if we assume that the change actually had
an effect (I cannot comment on this since I have no Linux system to try
at hand but I will assume it did) then it must be the case that some
other than the runaway process was preempted when the low space signal
occured. Could this possibly be the case?
Yes. Easily so. The "major safeguard" that Tim referred to is actually a
myth to a large extent. Here are two *trivial* examples which will both
lock up your system and where no safeguard that I am aware about could
possibly do much:
[[true] whileTrue]
forkAt: Processor highestPriority.
[ | oc |
oc := OrderedCollection new.
[true] whileTrue:[oc add: (Array new: 1000000)]
] forkAt: Processor highestPriority.
The first example merely locks up your image while the second will crash
it with a low-space condition. The point is that both of these run at
maximum priority so they will *never* be interrupted (and therefore you
are toast).
Note that these examples merely serve to destroy the myth nothing more.
But similar to these examples, there is a certain probability that some
other process gets hit when we look for the preempted process.
Statistically speaking, it should be a rare occasion that we hit the
"wrong" process but it is possible - before the change I had occasional
(and annoying!) hits in the finalization process which just happened to
be the "last active process" when I hit the interrupt key. But this is
typically no problem since if we hit the wrong process we can just hit
alt-period again until we have it (and chances grow exponentially that
you hit the "right" guy[**]).
[**] The not-so-seasoned Squeaker should not assume that this always
solves the problem. For example a runaway loop in a drawing method might
be invoked immediately again when we try to display the debug notifier.
This would look like you hadn't hit the right process but actually you
did and it's the morphic drawing process itself which keeps recreating
the problem. (and when this happens you are toast too)
But the low space semaphore is different. If you miss once, you are
toast. And I think this might just be what you are seeing. What is
interesting in this regard is that when you have a low-space condition,
one of the things that happens *before* we signal the condition, we do a
full garbage collection which takes significant amounts of time. Often
more than 500ms. Which just happens to be the threshold after which the
event tickler would get activated and might explain why we would see
this problem only since the event tickler arrived and why you wouldn't
see it in MVC. An alternative would be that the finalization process is
running after the full GC (good chance that some weak reference just got
lost) and this guy might be hit too (though I think it wouldn't explain
the MVC/Morphic difference).
One way to validate that theory would be to run an MVC process like:
dummyTickler := [
[true] whileTrue:[(Delay forMilliseconds: 500) wait]
] forkAt: Processor lowIOPriority.
and see if this now makes Squeak fail in the same way that Morphic does
upon a low space condition. If it does, I am almost certain that's the
problem (and I am almost certain it is - it fits all the observed data).
If this is really the problem, then the only true solution I can see is
to modify the VM so that we have enough information available to figure
out what the preempted process was at the time that particular signal
(user interrupt or low space) happened[***]. Note that the information
might be indirect - if, for example, we could measure how long a process
was running *without* giving up control to a lower priority process
(e.g., actively yielding or waiting on a semaphore) this should already
be enough to identify the "right" guy to interrupt.
[***] I do not consider the current state acceptable so I discount that
as an alternative solution.
Cheers,
- Andreas
More information about the Squeak-dev
mailing list
|