Hi Folks -
Trying to track a problem in Tweak I ran into the above mentioned update. I won't comment on the process of how it got into the image (I think you all know my opinion about eyeballing system-critical changes and I am almost certain that the approval went something like "oh, what possible harm could a one-line change do") but I am rather interested in what problem this is trying to fix.
The comment of the change set states: "The low space watcher is interrupted in the context of the wrong process when the eventTickler process (or other high priority process) is running. This prevents low space detection from functioning properly."
But this makes no sense whatsoever. The low space watcher can be only be interrupted if it is running (non-running processes are never interrupted) but all the running low space watcher does is to throw up an interrupt (which we all agree on is the right thing to do) on the currently active process. If the user presses Alt-period when this happens why would it be wrong to interrupt the low space watcher?
It is precisely the right thing to do (in fact, it is the only thing that can be done if we are to honor the user interrupt). If we choose to *ignore* the actual running process (such as is done in the above update) and just arbitrarily interrupt the "main" project process we will never be able to interrupt any run-away process which is different from the main process. As an example, take:
[[true] whileTrue:[]]] forkAt: Processor userSchedulingPriority+1.
With the original version of the method you *were* able to interrupt that process (try it - revert to its previous version run the above and hit alt-period). With the version of the method in the update you will not be able to interrupt a runaway process like the above.
Whatever this update is trying to address, it cannot possibly be what is being claimed in the preamble. So what problem is this update trying to solve?
Cheers, - Andreas
Hi andreas
Indeed it looks like. I checked all the three version of the methods (see below).
If I'm correct I think that this is was one fix against getting image completely trashing (at least on mac). At least people suggested to me to load that cs in my image when I could not even halt the system anymore. This is a REAL problem for the environment that will support my new book, since kids playing with conditional loops will write endless loops and if ctrl . does not work, I'm fucked up. because they will throw everything together. So I would love to have a system where you can ALWAYS get a debugger or stop a crazy computation.
I posted several anxious posts on this KEY behavior in the past. If I'm correct david is the guy doing the ProcessBrowser so I'm sure that the harvester was believing in him. And I do not know who pushed that in the stream but I would not blame him. See my previous sentence. And yes the reviewing process could be improved, but as usual nobody is really doing anything except his current most urgent task so....nothing good can happen with a process like that.
Stef
Trying to track a problem in Tweak I ran into the above mentioned update. I won't comment on the process of how it got into the image (I think you all know my opinion about eyeballing system-critical changes and I am almost certain that the approval went something like "oh, what possible harm could a one-line change do") but I am rather interested in what problem this is trying to fix.
The comment of the change set states: "The low space watcher is interrupted in the context of the wrong process when the eventTickler process (or other high priority process) is running. This prevents low space detection from functioning properly."
But this makes no sense whatsoever. The low space watcher can be only be interrupted if it is running (non-running processes are never interrupted) but all the running low space watcher does is to throw up an interrupt (which we all agree on is the right thing to do) on the currently active process. If the user presses Alt-period when this happens why would it be wrong to interrupt the low space watcher?
It is precisely the right thing to do (in fact, it is the only thing that can be done if we are to honor the user interrupt). If we choose to *ignore* the actual running process (such as is done in the above update) and just arbitrarily interrupt the "main" project process we will never be able to interrupt any run-away process which is different from the main process. As an example, take:
[[true] whileTrue:[]]] forkAt: Processor userSchedulingPriority+1.
With the original version of the method you *were* able to interrupt that process (try it - revert to its previous version run the above and hit alt-period). With the version of the method in the update you will not be able to interrupt a runaway process like the above.
Andreas in my image I have three version the latest is a fix of ned
interruptName: labelString "Create a Notifier on the active scheduling process with the given label." | preemptedProcess projectProcess suspendingList | Smalltalk isMorphic ifFalse: [^ ScheduledControllers interruptName: labelString]. ActiveHand ifNotNil:[ActiveHand interrupted]. ActiveWorld := World. "reinstall active globals" ActiveHand := World primaryHand. ActiveWorld _ World. "reinstall active globals" ActiveHand _ World primaryHand. ActiveHand interrupted. "make sure this one's interrupted too" ActiveEvent := nil. BotProcess interrupted. ActiveEvent _ nil.
projectProcess := self uiProcess. "we still need the accessor for a while" preemptedProcess := Processor preemptedProcess. projectProcess _ self uiProcess. "we still need the accessor for a while" preemptedProcess _ Processor preemptedProcess. "Only debug preempted process if its priority is >= projectProcess' priority" preemptedProcess priority < projectProcess priority ifTrue:[ (suspendingList := projectProcess suspendingList) == nil (suspendingList _ projectProcess suspendingList) == nil ifTrue: [projectProcess == Processor activeProcess ifTrue: [projectProcess suspend]] ifFalse: [suspendingList remove: projectProcess ifAbsent: []. projectProcess offList]. preemptedProcess := projectProcess. preemptedProcess _ projectProcess. ] ifFalse:[ preemptedProcess suspend offList. preemptedProcess _ projectProcess suspend offList. ]. Debugger openInterrupt: labelString onProcess: preemptedProcess
I tried reverting to the code of dan and I could stop it, open a debugger and it worked I tried also with the version of ned and I could stop it too. I tried with the version of dtl and I could stop it but I could not get a debugger and the system was kind of freezing....
So I hope this helps.
Whatever this update is trying to address, it cannot possibly be what is being claimed in the preamble. So what problem is this update trying to solve?
We should really solve that point because this is important. I got strange problem wwhen having a bug in the bounds methods of certain morphs and was in a situation where I could not get a debugger and having systematically the system trashing. I reported that too.
Stef
Cheers,
- Andreas
Hi Andreas!
I think this [1] is the root of the thread. I did test this fix in [2] by looking at the symptoms and if they would vanish and just came to the conclusion, that the fix had no positive effect on windows. I did not look deeper into the problem, but was hoping to sound vague enough with my analysis of the problem to be instantly corrected by any expert of the domain! :)
Beers, Alex
[1]http://lists.squeakfoundation.org/pipermail/squeak-dev/2004-May/078325.html [2]http://lists.squeakfoundation.org/pipermail/squeak-dev/2004-July/079486.html Andreas Raab schrieb:
Hi Folks -
Trying to track a problem in Tweak I ran into the above mentioned update. I won't comment on the process of how it got into the image (I think you all know my opinion about eyeballing system-critical changes and I am almost certain that the approval went something like "oh, what possible harm could a one-line change do") but I am rather interested in what problem this is trying to fix.
The comment of the change set states: "The low space watcher is interrupted in the context of the wrong process when the eventTickler process (or other high priority process) is running. This prevents low space detection from functioning properly."
But this makes no sense whatsoever. The low space watcher can be only be interrupted if it is running (non-running processes are never interrupted) but all the running low space watcher does is to throw up an interrupt (which we all agree on is the right thing to do) on the currently active process. If the user presses Alt-period when this happens why would it be wrong to interrupt the low space watcher?
It is precisely the right thing to do (in fact, it is the only thing that can be done if we are to honor the user interrupt). If we choose to *ignore* the actual running process (such as is done in the above update) and just arbitrarily interrupt the "main" project process we will never be able to interrupt any run-away process which is different from the main process. As an example, take:
[[true] whileTrue:[]]] forkAt: Processor userSchedulingPriority+1.
With the original version of the method you *were* able to interrupt that process (try it - revert to its previous version run the above and hit alt-period). With the version of the method in the update you will not be able to interrupt a runaway process like the above.
Whatever this update is trying to address, it cannot possibly be what is being claimed in the preamble. So what problem is this update trying to solve?
Cheers,
- Andreas
Hi Andreas,
I am the source of this patch.
If you have a copy of BFAV available, the reviews of this issue are under the "update" tab with the subject line: "[bug][FIX] stack overflow crashes Squeak. (one-line fix attached)"
In case you do not have a working BFAV handy, I am attaching a copy of the change set that I used originally to debug the problem (runs on Unix, sorry, but you'll get the idea). I'm also copying some of the relevant discussion at the end of this message.
Note that the symptoms of this problem appear differently on different VMs and different memory allocation settings. It also did not occur until after the introduction of the event tickler process into the image, so this appeared to be a pre-existing bug that had been masked until the event tickler process was added.
I'm short of time right now, but I'll take another look at this as soon as I get some free time. Having said that, I'm no expert on the process scheduler or memory manager, so a better informed explanation and/or fix would be welcome.
On Thu, Mar 31, 2005 at 12:48:38AM -0800, Andreas Raab wrote:
Hi Folks -
Trying to track a problem in Tweak I ran into the above mentioned update. I won't comment on the process of how it got into the image (I think you all know my opinion about eyeballing system-critical changes and I am almost certain that the approval went something like "oh, what possible harm could a one-line change do") but I am rather interested in what problem this is trying to fix.
The low space interrupt failed to get through to the image. When running with fixed memory allocation, the result was a VM crash.
The comment of the change set states: "The low space watcher is interrupted in the context of the wrong process when the eventTickler process (or other high priority process) is running. This prevents low space detection from functioning properly."
But this makes no sense whatsoever.
Poor wording and/or misunderstanding on my part. If I remember correctly, the low space interrupt was "appearing" in the wrong process context after the semaphore was signaled.
Whatever this update is trying to address, it cannot possibly be what is being claimed in the preamble. So what problem is this update trying to solve?
Uninterruptable recursion followed by a hard VM crash on some VMs and memory settings. Not a good thing for naive users who would be most likely to make mistakes of this kind.
HTH, Dave
------------------------------------- Snippings from BFAV:
Background on how and when the bug first became evident:
Hi Doug,
I have not reconstructed from old images, but my best guess is that the bug entered the image with the Project class>>interruptName: method, which is time stamped 9/5/2001. The bug was present at that point, but was not manifested until someone else added a high priority background process into the system, which just happened to be the otherwise blameless EventTickler process (EventSensor>>eventTickler), which runs all the time at lowIOPriority. This was introduced in update 5000 in April 2004, so I would expect that people started noticing the problem after that time.
The current thread in BFAV dates back to September 2003, which suggests that the symptoms of this problem were being seen before the EventTickler was introduced (or maybe I just did not follow the trail all the way back). So I'm not entirely sure how long people have been seeing symptoms of the problem. I'm reasonably sure (call it about 90% confidence level, gut feel) that the bug/fix that I posted addresses the underlying issue, although I would not be surprised if it turns out that there are other lurking buglets that might lead to similar symptoms.
Important: This one is timing-dependent, and you may see different symptoms depending on the VM and any memory settings you may have used on the command line. On my Linux system, if I force a limit on the amount of memory used (with "squeak -memory 10m"), I end up with a real VM crash, stack dump and all. If I don't limit the memory (which would be the normal mode of use), the image just becomes unresponsive when it gets into an infinite recursion, and cannot be interrupted. Presumably the VM is busy trying to allocate more memory from the OS, but does not actually crash while it's chewing away on this problem.
I thought that I had read an earlier report that (John's) Mac OS X does not exhibit the problem, apparently due to its use of a threaded VM that is more responsive to UI events. However, your description of the behavior on your OS X system sounds quite similar to what I see on Linux.
RiscOS seems to behave similarly to Linux, and I don't know what Windows does.
How to reproduce the bug:
Attached is a change set that I used to debug the stack overflow problem and confirm the fix. This only runs on a Unix VM with OSProcess loaded, but the overflow problem is a bit tricky to debug so I'm posting this in case someone wants to reproduce what I did.
Basically this just writes debug trace messages to standard output so I can keep track of what process is running what method in what order. Just some good ol' fashioned Fortran debugging, but what the heck, it worked.
From the preamble:
This is what I used to debug the stack overflow problem. Load OSProcess first, then load this change set.
Intended for use on Unix/Linux. Run the Squeak vm with a fixed memory allocation (squeak -memory 30m) in order to force the out-of-memory condition.
Open a ProcessBrowser, then evaluate 'Smalltalk createStackOverflow'. You should see messages on stdout that confirm that the runaway recursion keeps going even after the low space semaphore has be signaled.
Now apply the LowSpaceWatcherFix change set, and evaluate 'Smalltalk createStackOverflow'. The low space watcher should catch the runaway method right away.
Newer Linux VM's grow memory dynamically, and do not start with any explicity memory limit.
Unix VM memory settings that affect whether or not the bug results in a hard crash:
There are two command-line options (with equivalent environment variables) to control how memory is allocated on Unix:
If no options are given then memory is allocated dynamically with the limit set at 75% of the available virtual memory.
If -memory N{mk} is given then memory is allocated statically; the argument to the option defines a hard upper limit.
If -mmap N{mk} is given then memory is allocated dynamically, with an explicit upper limit to the amount of memory that will be allocated (but the "75% of available virtual memory" limit still applies).
Ian
One part of this issue was that (at least for me, on a no-grow-OM VM) the recursion would use up all memory in less than the time I could move my hands fomr the cmd-d to the interrupt. The apparent non-responsiveness was _not_ because the interrupt was being ignored but becasue the system had already crashed and was spending a long time trying to write out the low space debug log.
I can't find much trace of emails on the subject but I think the low space signal was being whacked and for whatever reason the scheduler wasn't getting around to dealing with it in a way that actually did anything helpful. IT seems that things went wrong somewhat differently on different OSs as well. Dave's change stopped the problem from killing the system.
My brief testing was pretty much the only review it got and if nobody else with more time and other hardware could find time to look at it, well that is the cost of volunteer development. Quite possibly it shouldn't have been harvested without more input but nobody can come along and twist your arm (for example) to look at things.
tim -- Tim Rowledge, tim@sumeru.stanford.edu, http://sumeru.stanford.edu/tim The computing field is always in need of new cliches. - Alan Perlis
Hi David -
Thanks for the history. Is there any post with a thorough analysis of the sequence of events that made things go wrong, and consequently, why changing the code the way you did would address this the problem? I'm asking, because I am still not clear about why a change in this particular place would make any difference on the low space watcher behavior whatsoever (neither did the debugging facilities you sent along seem to be related to Project>>interruptName:).
Cheers, - Andreas
David T. Lewis wrote:
Hi Andreas,
I am the source of this patch.
If you have a copy of BFAV available, the reviews of this issue are under the "update" tab with the subject line: "[bug][FIX] stack overflow crashes Squeak. (one-line fix attached)"
In case you do not have a working BFAV handy, I am attaching a copy of the change set that I used originally to debug the problem (runs on Unix, sorry, but you'll get the idea). I'm also copying some of the relevant discussion at the end of this message.
Note that the symptoms of this problem appear differently on different VMs and different memory allocation settings. It also did not occur until after the introduction of the event tickler process into the image, so this appeared to be a pre-existing bug that had been masked until the event tickler process was added.
I'm short of time right now, but I'll take another look at this as soon as I get some free time. Having said that, I'm no expert on the process scheduler or memory manager, so a better informed explanation and/or fix would be welcome.
On Thu, Mar 31, 2005 at 12:48:38AM -0800, Andreas Raab wrote:
Hi Folks -
Trying to track a problem in Tweak I ran into the above mentioned update. I won't comment on the process of how it got into the image (I think you all know my opinion about eyeballing system-critical changes and I am almost certain that the approval went something like "oh, what possible harm could a one-line change do") but I am rather interested in what problem this is trying to fix.
The low space interrupt failed to get through to the image. When running with fixed memory allocation, the result was a VM crash.
The comment of the change set states: "The low space watcher is interrupted in the context of the wrong process when the eventTickler process (or other high priority process) is running. This prevents low space detection from functioning properly."
But this makes no sense whatsoever.
Poor wording and/or misunderstanding on my part. If I remember correctly, the low space interrupt was "appearing" in the wrong process context after the semaphore was signaled.
Whatever this update is trying to address, it cannot possibly be what is being claimed in the preamble. So what problem is this update trying to solve?
Uninterruptable recursion followed by a hard VM crash on some VMs and memory settings. Not a good thing for naive users who would be most likely to make mistakes of this kind.
HTH, Dave
Snippings from BFAV:
Background on how and when the bug first became evident:
Hi Doug,
I have not reconstructed from old images, but my best guess is that the bug entered the image with the Project class>>interruptName: method, which is time stamped 9/5/2001. The bug was present at that point, but was not manifested until someone else added a high priority background process into the system, which just happened to be the otherwise blameless EventTickler process (EventSensor>>eventTickler), which runs all the time at lowIOPriority. This was introduced in update 5000 in April 2004, so I would expect that people started noticing the problem after that time.
The current thread in BFAV dates back to September 2003, which suggests that the symptoms of this problem were being seen before the EventTickler was introduced (or maybe I just did not follow the trail all the way back). So I'm not entirely sure how long people have been seeing symptoms of the problem. I'm reasonably sure (call it about 90% confidence level, gut feel) that the bug/fix that I posted addresses the underlying issue, although I would not be surprised if it turns out that there are other lurking buglets that might lead to similar symptoms.
Important: This one is timing-dependent, and you may see different symptoms depending on the VM and any memory settings you may have used on the command line. On my Linux system, if I force a limit on the amount of memory used (with "squeak -memory 10m"), I end up with a real VM crash, stack dump and all. If I don't limit the memory (which would be the normal mode of use), the image just becomes unresponsive when it gets into an infinite recursion, and cannot be interrupted. Presumably the VM is busy trying to allocate more memory from the OS, but does not actually crash while it's chewing away on this problem.
I thought that I had read an earlier report that (John's) Mac OS X does not exhibit the problem, apparently due to its use of a threaded VM that is more responsive to UI events. However, your description of the behavior on your OS X system sounds quite similar to what I see on Linux.
RiscOS seems to behave similarly to Linux, and I don't know what Windows does.
How to reproduce the bug:
Attached is a change set that I used to debug the stack overflow problem and confirm the fix. This only runs on a Unix VM with OSProcess loaded, but the overflow problem is a bit tricky to debug so I'm posting this in case someone wants to reproduce what I did.
Basically this just writes debug trace messages to standard output so I can keep track of what process is running what method in what order. Just some good ol' fashioned Fortran debugging, but what the heck, it worked.
From the preamble:
This is what I used to debug the stack overflow problem. Load OSProcess first, then load this change set.
Intended for use on Unix/Linux. Run the Squeak vm with a fixed memory allocation (squeak -memory 30m) in order to force the out-of-memory condition.
Open a ProcessBrowser, then evaluate 'Smalltalk createStackOverflow'. You should see messages on stdout that confirm that the runaway recursion keeps going even after the low space semaphore has be signaled.
Now apply the LowSpaceWatcherFix change set, and evaluate 'Smalltalk createStackOverflow'. The low space watcher should catch the runaway method right away.
Newer Linux VM's grow memory dynamically, and do not start with any explicity memory limit.
Unix VM memory settings that affect whether or not the bug results in a hard crash:
There are two command-line options (with equivalent environment variables) to control how memory is allocated on Unix:
If no options are given then memory is allocated dynamically with the limit set at 75% of the available virtual memory.
If -memory N{mk} is given then memory is allocated statically; the argument to the option defines a hard upper limit.
If -mmap N{mk} is given then memory is allocated dynamically, with an explicit upper limit to the amount of memory that will be allocated (but the "75% of available virtual memory" limit still applies).
Ian
On Thu, Mar 31, 2005 at 06:27:02PM -0800, Andreas Raab wrote:
Hi David -
Thanks for the history. Is there any post with a thorough analysis of the sequence of events that made things go wrong, and consequently, why changing the code the way you did would address this the problem? I'm asking, because I am still not clear about why a change in this particular place would make any difference on the low space watcher behavior whatsoever (neither did the debugging facilities you sent along seem to be related to Project>>interruptName:).
Hi Andreas,
I'm afraid that my memory is not too clear on the details at this point, other that what I pulled out of the BFAV postings. There were a few other messages, but the heart of it was what I sent to you.
If you can wait until this weekend, I will spend some time figuring out where this "fix" went wrong, and hopefully send you a more helpful response (I have been looking at it this evening, but I am out of time and energy until the weekend).
<speculation> I think there is some difference in the way the low space semaphore is being handled, versus the way the user interrupt semaphore is being handled. Obviously I mis-diagnosed it, and I also did not consider the impact on the user interrupt handler. Perhaps the real problem is related to "Debugger openInterrupt: labelString onProcess: preemptedProcess" not being able to open the debugger in time to stop the runaway recursion, but that's just a guess at this point (and obviously my guesswork has been less than satisfactory). </speculation>
I believe it is true that the VM crash will happen only if: - VM is running with fixed memory allocation (hence you will not see it on the Windows VM). - Current project is Morphic (MVC is OK). - Background processes (e.g. event tickler) are running at higher priority than the UI process.
With respect to the debugging facilities that I sent, the output on stdout (console) for Squeak running a Unix VM with fixed memory allocation looks like this (note, I added one additional debugging line here to indicate when the debugger is about to be opened, hence my speculation above):
lewis@dtlewis:~/squeak/squeak3.7> squeak -memory 10m Dev 3048:2445:Project class>>interruptName::about to open debugger with label Space is low 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 0 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 1 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 2 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 3 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 4 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 5 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 6 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 7 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 8 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 9 2561:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 10 2561:3864:SystemDictionary>>createStackOverflow::terminate recursion at depth: 10
More later,
Dave
Hi David -
Well, let's see if I can reverse engineer the problem from the supposed solution ;-)
But first, we need an understanding of what the code in Project>>interruptName: was trying to achieve originally: When we encounter a user interrupt or a low-space condition, a semaphore in the VM gets signaled and a (higher priority process) reacts to the semaphore[*].
[*] I will ignore the subtle difference between the VM signaling the user interrupt semaphore and it being signaled by the event tickler since I don't think this matters (but we'll see - I'm writing this as I go ;-)
This higher priority process (indirectly) invokes Project>>interruptName: which then needs to figure out which process to actually suspend. This is not at all trivial since the VM does not retain sufficient information about the process that was active when the semaphore got signaled - and therefore Project>>interruptName: tries to infer which process got interrupted (by using Processor>>preemptedProcess) and then put this process to sleep for further debugging.
The change in 6293LowSpaceWatcherFix-dtl made it so that it was *always* the main morphic process which gets interrupted (instead of Processor preemptedProcess). And while this might be a reasonable first order guess, it is completely missing the point that the main morphic process isn't always the runaway process.
[Sidenote: I was just really surprised that "[[true] whileTrue] fork" didn't totally lock up 3.8 as I would have expected with the change - but it turns out that the event tickler has the interesting side effect that it allows the main morphic process to run once every 500ms, therefore giving one a barely usable system. But of course, hitting alt-period will never interrupt that runaway process either so changing this to "[[true] whileTrue] forkAt: Processor activeProcess priority+1" still kills 3.8 completely]
But to get us back on track, if we assume that the change actually had an effect (I cannot comment on this since I have no Linux system to try at hand but I will assume it did) then it must be the case that some other than the runaway process was preempted when the low space signal occured. Could this possibly be the case?
Yes. Easily so. The "major safeguard" that Tim referred to is actually a myth to a large extent. Here are two *trivial* examples which will both lock up your system and where no safeguard that I am aware about could possibly do much:
[[true] whileTrue] forkAt: Processor highestPriority.
[ | oc | oc := OrderedCollection new. [true] whileTrue:[oc add: (Array new: 1000000)] ] forkAt: Processor highestPriority.
The first example merely locks up your image while the second will crash it with a low-space condition. The point is that both of these run at maximum priority so they will *never* be interrupted (and therefore you are toast).
Note that these examples merely serve to destroy the myth nothing more. But similar to these examples, there is a certain probability that some other process gets hit when we look for the preempted process. Statistically speaking, it should be a rare occasion that we hit the "wrong" process but it is possible - before the change I had occasional (and annoying!) hits in the finalization process which just happened to be the "last active process" when I hit the interrupt key. But this is typically no problem since if we hit the wrong process we can just hit alt-period again until we have it (and chances grow exponentially that you hit the "right" guy[**]).
[**] The not-so-seasoned Squeaker should not assume that this always solves the problem. For example a runaway loop in a drawing method might be invoked immediately again when we try to display the debug notifier. This would look like you hadn't hit the right process but actually you did and it's the morphic drawing process itself which keeps recreating the problem. (and when this happens you are toast too)
But the low space semaphore is different. If you miss once, you are toast. And I think this might just be what you are seeing. What is interesting in this regard is that when you have a low-space condition, one of the things that happens *before* we signal the condition, we do a full garbage collection which takes significant amounts of time. Often more than 500ms. Which just happens to be the threshold after which the event tickler would get activated and might explain why we would see this problem only since the event tickler arrived and why you wouldn't see it in MVC. An alternative would be that the finalization process is running after the full GC (good chance that some weak reference just got lost) and this guy might be hit too (though I think it wouldn't explain the MVC/Morphic difference).
One way to validate that theory would be to run an MVC process like:
dummyTickler := [ [true] whileTrue:[(Delay forMilliseconds: 500) wait] ] forkAt: Processor lowIOPriority.
and see if this now makes Squeak fail in the same way that Morphic does upon a low space condition. If it does, I am almost certain that's the problem (and I am almost certain it is - it fits all the observed data).
If this is really the problem, then the only true solution I can see is to modify the VM so that we have enough information available to figure out what the preempted process was at the time that particular signal (user interrupt or low space) happened[***]. Note that the information might be indirect - if, for example, we could measure how long a process was running *without* giving up control to a lower priority process (e.g., actively yielding or waiting on a semaphore) this should already be enough to identify the "right" guy to interrupt.
[***] I do not consider the current state acceptable so I discount that as an alternative solution.
Cheers, - Andreas
Andreas Raab andreas.raab@gmx.de wrote:
Yes. Easily so. The "major safeguard" that Tim referred to is actually a
Not me so far as I can tell...
If this is really the problem, then the only true solution I can see is to modify the VM so that we have enough information available to figure out what the preempted process was at the time that particular signal (user interrupt or low space) happened[***]. Note that the information might be indirect - if, for example, we could measure how long a process was running *without* giving up control to a lower priority process (e.g., actively yielding or waiting on a semaphore) this should already be enough to identify the "right" guy to interrupt.
Well we know the active context of course and for a little more work we know the active process, so the info is there. The amusing part is that the semaphore used for low space has the lowspace process on its list and we remove it as part of signalling. Maybe if we stuck the active process on the end as some 'helpful context' it would make life simpler. Or a subclass of Semaphore with a more explicit ivar(s) if we want to reduce confusion potential.
tim -- Tim Rowledge, tim@sumeru.stanford.edu, http://sumeru.stanford.edu/tim The colder the X-ray table, the more of your body is required on it.
Tim Rowledge wrote:
Andreas Raab andreas.raab@gmx.de wrote:
Yes. Easily so. The "major safeguard" that Tim referred to is actually a
Not me so far as I can tell...
Then this must be a clear case of identity theft ;-))
http://lists.squeakfoundation.org/pipermail/squeak-dev/2004-May/078325.html
The amusing part is that the semaphore used for low space has the lowspace process on its list and we remove it as part of signalling. Maybe if we stuck the active process on the end as some 'helpful context' it would make life simpler. Or a subclass of Semaphore with a more explicit ivar(s) if we want to reduce confusion potential.
I think it's a little more complex than that. It would be nice if we could define a few reliable test cases to spec out what it actually is that we think ought to work. I'll throw in the following:
* "[true] whileTrue" - we must be able to interrupt this * "[[true] whileTrue] forkAt: Processor userSchedulingPriority + 1" - again we must be able to interrupt that * "Smalltalk stackOverflow" - we must be able to recover from that * "[Smalltalk stackOverflow] forkAt: Processor userSchedulingPriority + 1" - again we must be able to recover from that
If there are more, please add.
Cheers, - Andreas
Andreas Raab andreas.raab@gmx.de wrote:
Tim Rowledge wrote:
Andreas Raab andreas.raab@gmx.de wrote:
Yes. Easily so. The "major safeguard" that Tim referred to is actually a
Not me so far as I can tell...
Then this must be a clear case of identity theft ;-))
http://lists.squeakfoundation.org/pipermail/squeak-dev/2004-May/078325.html
Golly. That _was_ a long time ago. But yes, I said it and I think in context I stand by it.
The amusing part is that the semaphore used for low space has the lowspace process on its list and we remove it as part of signalling. Maybe if we
stuck
the active process on the end as some 'helpful context' it would make life simpler. Or a subclass of Semaphore with a more explicit ivar(s) if we want
to
reduce confusion potential.
I think it's a little more complex than that.
Of course, no argument there. I was merely musing on a plausible trivial VM mechanism for passing up some helpful stuff to avoid try to infer the appropriate process to beat up. When we are signalling a very specific smeaphore like lowspace we can be quite certain that the active process when the allocation code gets upset is the one we want to deal with, right now, no excuses.
The user interrupt is a little different in that it is easy to say "stop the active process" but that isn't always what one really wants to do. As you mentioned earlier, sometimes one will get the weak finalization process, sometime the event tickler, or the main event loop etc. Generally in my experience I use the interrupt to try to stop the main user process to try to see WTF it is up to and why it hasn't done anything visible in a while. It would be nice to have some way to pop up a list of current processes and choose one to play with. I realise the process browser does some of that. I've noticed in Tweak that it is quite hard to get to interrupt anything except the event loop which rarely seems to lead me anywhere interesting when trying to debug TK4 stuff.
tim -- Tim Rowledge, tim@sumeru.stanford.edu, http://sumeru.stanford.edu/tim Useful Latin Phrases:- Mihi ignosce. Cum homine de cane debeo congredi = Excuse me. I've got to see a man about a dog.
On Fri, Apr 01, 2005 at 12:34:55AM -0800, Andreas Raab wrote:
Tim Rowledge wrote:
The amusing part is that the semaphore used for low space has the lowspace process on its list and we remove it as part of signalling. Maybe if we stuck the active process on the end as some 'helpful context' it would make life simpler. Or a subclass of Semaphore with a more explicit ivar(s) if we want to reduce confusion potential.
I think it's a little more complex than that. It would be nice if we could define a few reliable test cases to spec out what it actually is that we think ought to work. I'll throw in the following:
- "[true] whileTrue" - we must be able to interrupt this
- "[[true] whileTrue] forkAt: Processor userSchedulingPriority + 1" -
again we must be able to interrupt that
- "Smalltalk stackOverflow" - we must be able to recover from that
- "[Smalltalk stackOverflow] forkAt: Processor userSchedulingPriority +
1" - again we must be able to recover from that
I posted "[VM][ENH] LowSpaceAndInterruptHandler-dtl" that demonstrates one way to handle this. All four of the above test cases pass in Morphic with these changes applied (using a Unix VM both with and without fixed memory allocation).
Unfortunately this does require a VM change, so it would be better if someone can think of a way to handle it without touching the VM. I could not see any good way to do that.
Dave
On Fri, Apr 01, 2005 at 12:34:55AM -0800, Andreas Raab wrote:
I think it's a little more complex than that. It would be nice if we could define a few reliable test cases to spec out what it actually is that we think ought to work. I'll throw in the following:
- "[true] whileTrue" - we must be able to interrupt this
- "[[true] whileTrue] forkAt: Processor userSchedulingPriority + 1" -
again we must be able to interrupt that
- "Smalltalk stackOverflow" - we must be able to recover from that
- "[Smalltalk stackOverflow] forkAt: Processor userSchedulingPriority +
1" - again we must be able to recover from that
If there are more, please add.
Some variations to add the above list:
1) Each case must be interruptable in Morphic and in MVC. 2) Each case must be interruptable either with or without the event tickler running (a high priority process that may become schedulable while a low space condition is being detected in the VM). 3) Each case must be interruptable with a fixed size object memory, or with a dynamically expanding object memory (varies by platform, "-memory 10m" flag for a Unix VM). 4) Each case must be interruptable if the user hits the interrupt key multiple times in succession.
Variation #4 fails in Morphic for the "[true] whileTrue" tests going back to at least Squeak 3.6.
The "[[true] whileTrue] forkAt: Processor userSchedulingPriority + 1" fails in MVC going back to at least Squeak 3.6.
Dave
Thanks Andreas,
I will run some tests this weekend and see if I can confirm which process is being interrupted and why. One additional data point I can add is that the condition is 100% repeatable. That is, with the combination of Unix VM + fixed memory allocation + Morphic + original #interruptName method + runaway recursion test, the result will always be a VM crash. This should make it much easer to debug.
One additional hypothesis: If you were to disable the "grow object memory" capability within a Win32 VM, I would expect to see the same VM crash behavior.
I'll post any results by this Sunday.
Dave
On Fri, Apr 01, 2005 at 06:44:35AM -0500, David T. Lewis wrote:
Thanks Andreas,
I will run some tests this weekend and see if I can confirm which process is being interrupted and why. One additional data point I can add is that the condition is 100% repeatable. That is, with the combination of Unix VM + fixed memory allocation + Morphic
- original #interruptName method + runaway recursion test, the
result will always be a VM crash. This should make it much easer to debug.
One additional hypothesis: If you were to disable the "grow object memory" capability within a Win32 VM, I would expect to see the same VM crash behavior.
I'll post any results by this Sunday.
As promised, here are some test results for the low memory handler problem. All tests were run on a Squeak 3.7 image with Unix VM set for fixed size object memory to force the out-of-memory condition. I used OSProcess and output tracing to the console in order to see what is happening in the interrupt handlers. The bad fix for #interruptName: is *not* applied in any of these tests, so this reflects behavior of the system prior to applying the bad fix. Some of the results are rather self-evident, but are included for completeness. For reference, I've also attached a change set with the hacks that I used for debugging.
Bottom line for the impatient: Andreas' analysis is correct, modulo a few details pertaining to MVC.
In Morphic, if I terminate the event tickler and start a dummyTickler process instead, I get a runaway overflow (VM crash) just as if the event tickler were running. If there is no event tickler, then no crash occurs.
dummyTickler _ [ [true] whileTrue: [(Delay forMilliseconds: 500) wait] ] forkAt: Processor lowIOPriority. Smalltalk createStackOverflow
If I do the same thing, but give the dummyTickler a longer time delay, then there is no failure. Presumably the dummyTickler was not ready to be scheduled, hence did not interfere with low space handling. This further supports the hypothesis that the event tickler is being incorrectly treated as the runaway process.
dummyTickler _ [ [true] whileTrue: [(Delay forMilliseconds: 50000) wait] ] forkAt: Processor lowIOPriority. Smalltalk createStackOverflow
In MVC, the following does *not* result in a VM crash. The low space watcher works fine despite having this process running. Also, if the event tickler process is terminated, this still works properly in MVC. In other words, in MVC we *always* handle the low space semaphore correctly (but see below), regardless of whether the event tickler or the dummyTickler (or both) is running.
dummyTickler _ [ [true] whileTrue: [(Delay forMilliseconds: 500) wait] ] forkAt: Processor lowIOPriority. Smalltalk createStackOverflow
Note however, in MVC if the stack overflow occurs in a background process, things go horribly wrong in ways I will not even try to describe.
[Smalltalk createStackOverflow] fork
I used OSProcess output tracing to identify the prempted process at the time of low space interrupt handling. In *both* Morphic and MVC, the event tickler is the preempted process (when the event tickler is running of course). See console trace logs below for the actual output.
This seems to bring us right back to the implementation of #interruptName:. There is an MVC implementation in ControlManager>>interruptName:, and a Morphic implementation in Project class>>interruptName:. The MVC implementation apparently handles the low space interrupt correctly, and the Morphic implementation does not.
What is different? The MVC implemention does not make any reference to the preempted process when figuring out which process to suspend. Instead it always suspends the ScheduledControllers activeControllerProcess process (see below, possibly this is why in MVC we cannot interrupt a high priority background process).
All of this supports Andreas' hypothesis that, when one process triggers the low memory semaphore, the higher priority event tickler process has become runnable by the time garbage collection completes and the semaphore has been signalled.
So this still leaves us with the problem of how to figure out which process actually was active at the time the low space interrupt was generated.
Just for completeness, in Morphic, the following *does not* hang the image, but in MVC it *does* hang the image.
[[true] whileTrue:[]] forkAt: Processor userSchedulingPriority+1.
So MVC apparently does not know how to handle either the low memory semaphore or the user interrupt semaphore if the offending process is running in background, separate from the main UI controller scheduling.
Following are copies of the console output obtained when running #createStackOverflow in both Morphic and MVC. In both cases, the #createStackOverflow method has been modified to produce console output, and to force termination if the method has been called more than 10 times after the low space semaphore was signalled. For each line in the trace output, the first number identifies the active process (the second number identifies the message receiver).
In Morphic the console output is:
lewis@dtlewis:~/squeak/squeak3.7> squeak -memory 7m squeak.7 1319:3840:UndefinedObject>>DoIt:about to run createStackOverflow from a workspace 430:3864:SystemDictionary>>lowSpaceWatcher:low space semaphore signal received, preempted process is [] in EventSensor>>eventTickler {[delay wait. delta := Time millisecondClockValue - lastEventPoll. (delta <...]} BlockContext>>on:do: [] in EventSensor>>installEventTickler {[self eventTickler]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
430:3864:SystemDictionary>>lowSpaceWatcher:about to display low space notifier 430:2445:Project class>>interruptName::entering #interruptName, preempted process is [] in EventSensor>>eventTickler {[delay wait. delta := Time millisecondClockValue - lastEventPoll. (delta <...]} BlockContext>>on:do: [] in EventSensor>>installEventTickler {[self eventTickler]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 0 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 1 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 2 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 3 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 4 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 5 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 6 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 7 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 8 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 9 1319:3864:SystemDictionary>>createStackOverflow::low space signal already received, continuing with recursion depth: 10 1319:3864:SystemDictionary>>createStackOverflow::terminate recursion at depth: 10 3886:3864:SystemDictionary>>lowSpaceWatcher:restarted low space watcher 3886:3864:SystemDictionary>>lowSpaceWatcher:install low space semaphore 3886:3864:SystemDictionary>>lowSpaceWatcher:enable low space interrupts 3886:3864:SystemDictionary>>lowSpaceWatcher:wait on low space semaphore 2954:3864:SystemDictionary>>lowSpaceWatcher:restarted low space watcher 2954:3864:SystemDictionary>>lowSpaceWatcher:install low space semaphore 2954:3864:SystemDictionary>>lowSpaceWatcher:enable low space interrupts 2954:3864:SystemDictionary>>lowSpaceWatcher:wait on low space semaphore
In MVC the console output is:
lewis@dtlewis:~/squeak/squeak3.7> squeak -memory 7m squeak.7 1898:3840:UndefinedObject>>DoIt:about to run createStackOverflow from a workspace 2069:3864:SystemDictionary>>lowSpaceWatcher:low space semaphore signal received, preempted process is [] in EventSensor>>eventTickler {[delay wait. delta := Time millisecondClockValue - lastEventPoll. (delta <...]} BlockContext>>on:do: [] in EventSensor>>installEventTickler {[self eventTickler]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
2069:3864:SystemDictionary>>lowSpaceWatcher:about to display low space notifier 2069:347:ControlManager>>interruptName::entering #interruptName, preempted process is [] in EventSensor>>eventTickler {[delay wait. delta := Time millisecondClockValue - lastEventPoll. (delta <...]} BlockContext>>on:do: [] in EventSensor>>installEventTickler {[self eventTickler]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
1794:3864:SystemDictionary>>lowSpaceWatcher:restarted low space watcher 1794:3864:SystemDictionary>>lowSpaceWatcher:install low space semaphore 1794:3864:SystemDictionary>>lowSpaceWatcher:enable low space interrupts 1794:3864:SystemDictionary>>lowSpaceWatcher:wait on low space semaphore
Dave
squeak-dev@lists.squeakfoundation.org