An analysis of interrupt behavior (was: Re: 6293LowSpaceWatcherFix-dtl considered harmful)

Andreas Raab andreas.raab at gmx.de
Fri Apr 1 05:30:09 UTC 2005


Hi David -

Well, let's see if I can reverse engineer the problem from the supposed 
solution ;-)

But first, we need an understanding of what the code in 
Project>>interruptName: was trying to achieve originally: When we 
encounter a user interrupt or a low-space condition, a semaphore in the 
VM gets signaled and a (higher priority process) reacts to the semaphore[*].

[*] I will ignore the subtle difference between the VM signaling the 
user interrupt semaphore and it being signaled by the event tickler 
since I don't think this matters (but we'll see - I'm writing this as I 
go ;-)

This higher priority process (indirectly) invokes 
Project>>interruptName: which then needs to figure out which process to 
actually suspend. This is not at all trivial since the VM does not 
retain sufficient information about the process that was active when the 
semaphore got signaled - and therefore Project>>interruptName: tries to 
infer which process got interrupted (by using 
Processor>>preemptedProcess) and then put this process to sleep for 
further debugging.

The change in 6293LowSpaceWatcherFix-dtl made it so that it was *always* 
the main morphic process which gets interrupted (instead of Processor 
preemptedProcess). And while this might be a reasonable first order 
guess, it is completely missing the point that the main morphic process 
isn't always the runaway process.

[Sidenote: I was just really surprised that "[[true] whileTrue] fork" 
didn't totally lock up 3.8 as I would have expected with the change - 
but it turns out that the event tickler has the interesting side effect 
that it allows the main morphic process to run once every 500ms, 
therefore giving one a barely usable system. But of course, hitting 
alt-period will never interrupt that runaway process either so changing 
this to "[[true] whileTrue] forkAt: Processor activeProcess priority+1" 
still kills 3.8 completely]

But to get us back on track, if we assume that the change actually had 
an effect (I cannot comment on this since I have no Linux system to try 
at hand but I will assume it did) then it must be the case that some 
other than the runaway process was preempted when the low space signal 
occured. Could this possibly be the case?

Yes. Easily so. The "major safeguard" that Tim referred to is actually a 
myth to a large extent. Here are two *trivial* examples which will both 
lock up your system and where no safeguard that I am aware about could 
possibly do much:

	[[true] whileTrue]
		forkAt: Processor highestPriority.

	[ | oc |
	  oc := OrderedCollection new.
	 [true] whileTrue:[oc add: (Array new: 1000000)]
	] forkAt: Processor highestPriority.

The first example merely locks up your image while the second will crash 
it with a low-space condition. The point is that both of these run at 
maximum priority so they will *never* be interrupted (and therefore you 
are toast).

Note that these examples merely serve to destroy the myth nothing more. 
But similar to these examples, there is a certain probability that some 
other process gets hit when we look for the preempted process. 
Statistically speaking, it should be a rare occasion that we hit the 
"wrong" process but it is possible - before the change I had occasional 
(and annoying!) hits in the finalization process which just happened to 
be the "last active process" when I hit the interrupt key. But this is 
typically no problem since if we hit the wrong process we can just hit 
alt-period again until we have it (and chances grow exponentially that 
you hit the "right" guy[**]).

[**] The not-so-seasoned Squeaker should not assume that this always 
solves the problem. For example a runaway loop in a drawing method might 
be invoked immediately again when we try to display the debug notifier. 
This would look like you hadn't hit the right process but actually you 
did and it's the morphic drawing process itself which keeps recreating 
the problem. (and when this happens you are toast too)

But the low space semaphore is different. If you miss once, you are 
toast. And I think this might just be what you are seeing. What is 
interesting in this regard is that when you have a low-space condition, 
one of the things that happens *before* we signal the condition, we do a 
full garbage collection which takes significant amounts of time. Often 
more than 500ms. Which just happens to be the threshold after which the 
event tickler would get activated and might explain why we would see 
this problem only since the event tickler arrived and why you wouldn't 
see it in MVC. An alternative would be that the finalization process is 
running after the full GC (good chance that some weak reference just got 
lost) and this guy might be hit too (though I think it wouldn't explain 
the MVC/Morphic difference).

One way to validate that theory would be to run an MVC process like:

	dummyTickler := [
		[true] whileTrue:[(Delay forMilliseconds: 500) wait]
	] forkAt: Processor lowIOPriority.

and see if this now makes Squeak fail in the same way that Morphic does 
upon a low space condition. If it does, I am almost certain that's the 
problem (and I am almost certain it is - it fits all the observed data).

If this is really the problem, then the only true solution I can see is 
to modify the VM so that we have enough information available to figure 
out what the preempted process was at the time that particular signal 
(user interrupt or low space) happened[***]. Note that the information 
might be indirect - if, for example, we could measure how long a process 
was running *without* giving up control to a lower priority process 
(e.g., actively yielding or waiting on a semaphore) this should already 
be enough to identify the "right" guy to interrupt.

[***] I do not consider the current state acceptable so I discount that 
as an alternative solution.

Cheers,
   - Andreas



More information about the Squeak-dev mailing list