6293LowSpaceWatcherFix-dtl considered harmful

Andreas Raab andreas.raab at gmx.de
Fri Apr 1 02:27:02 UTC 2005


Hi David -

Thanks for the history. Is there any post with a thorough analysis of 
the sequence of events that made things go wrong, and consequently, why 
changing the code the way you did would address this the problem? I'm 
asking, because I am still not clear about why a change in this 
particular place would make any difference on the low space watcher 
behavior whatsoever (neither did the debugging facilities you sent along 
seem to be related to Project>>interruptName:).

Cheers,
   - Andreas

David T. Lewis wrote:
> Hi Andreas,
> 
> I am the source of this patch.
> 
> If you have a copy of BFAV available, the reviews of this issue
> are under the "update" tab with the subject line:
>  "[bug][FIX] stack overflow crashes Squeak. (one-line fix attached)"
> 
> In case you do not have a working BFAV handy, I am attaching a
> copy of the change set that I used originally to debug the problem
> (runs on Unix, sorry, but you'll get the idea). I'm also copying
> some of the relevant discussion at the end of this message.
> 
> Note that the symptoms of this problem appear differently on
> different VMs and different memory allocation settings. It also
> did not occur until after the introduction of the event tickler
> process into the image, so this appeared to be a pre-existing
> bug that had been masked until the event tickler process was
> added.
> 
> I'm short of time right now, but I'll take another look at this
> as soon as I get some free time. Having said that, I'm no expert
> on the process scheduler or memory manager, so a better informed
> explanation and/or fix would be welcome.
> 
> On Thu, Mar 31, 2005 at 12:48:38AM -0800, Andreas Raab wrote:
> 
>>Hi Folks -
>>
>>Trying to track a problem in Tweak I ran into the above mentioned 
>>update. I won't comment on the process of how it got into the image (I 
>>think you all know my opinion about eyeballing system-critical changes 
>>and I am almost certain that the approval went something like "oh, what 
>>possible harm could a one-line change do") but I am rather interested in 
>>what problem this is trying to fix.
> 
> 
> The low space interrupt failed to get through to the image. When running
> with fixed memory allocation, the result was a VM crash.
> 
> 
>>The comment of the change set states: "The low space watcher is 
>>interrupted in the context of the wrong process when the eventTickler 
>>process (or other high priority process) is running. This prevents low 
>>space detection from functioning properly."
>>
>>But this makes no sense whatsoever.
> 
> 
> Poor wording and/or misunderstanding on my part. If I remember correctly,
> the low space interrupt was "appearing" in the wrong process context
> after the semaphore was signaled.
>  
> 
>>Whatever this update is trying to address, it cannot possibly be what is 
>>being claimed in the preamble. So what problem is this update trying to 
>>solve?
> 
> 
> Uninterruptable recursion followed by a hard VM crash on some VMs and
> memory settings. Not a good thing for naive users who would be most
> likely to make mistakes of this kind.
> 
> HTH,
> Dave
> 
> -------------------------------------
> Snippings from BFAV:
> 
> Background on how and when the bug first became evident:
> 
> 
>>Hi Doug,
>>
>>I have not reconstructed from old images, but my best guess is that
>>the bug entered the image with the Project class>>interruptName:
>>method, which is time stamped 9/5/2001. The bug was present at that
>>point, but was not manifested until someone else added a high priority
>>background process into the system, which just happened to be the
>>otherwise blameless EventTickler process (EventSensor>>eventTickler),
>>which runs all the time at lowIOPriority. This was introduced in
>>update 5000 in April 2004, so I would expect that people started
>>noticing the problem after that time.
>>
>>The current thread in BFAV dates back to September 2003, which
>>suggests that the symptoms of this problem were being seen before
>>the EventTickler was introduced (or maybe I just did not follow
>>the trail all the way back). So I'm not entirely sure how long
>>people have been seeing symptoms of the problem. I'm reasonably
>>sure (call it about 90% confidence level, gut feel) that the
>>bug/fix that I posted addresses the underlying issue, although
>>I would not be surprised if it turns out that there are other
>>lurking buglets that might lead to similar symptoms.
>>
>>Important: This one is timing-dependent, and you may see different
>>symptoms depending on the VM and any memory settings you may have
>>used on the command line. On my Linux system, if I force a limit
>>on the amount of memory used (with "squeak -memory 10m"), I end
>>up with a real VM crash, stack dump and all. If I don't limit the
>>memory (which would be the normal mode of use), the image just
>>becomes unresponsive when it gets into an infinite recursion, and
>>cannot be interrupted. Presumably the VM is busy trying to allocate
>>more memory from the OS, but does not actually crash while it's
>>chewing away on this problem.
>>
>>I thought that I had read an earlier report that (John's) Mac
>>OS X does not exhibit the problem, apparently due to its use of
>>a threaded VM that is more responsive to UI events. However, your
>>description of the behavior on your OS X system sounds quite similar
>>to what I see on Linux. 
>>
>>RiscOS seems to behave similarly to Linux, and I don't know what
>>Windows does.
> 
> 
> 
> How to reproduce the bug:
> 
> 
>>Attached is a change set that I used to debug the stack overflow problem
>>and confirm the fix. This only runs on a Unix VM with OSProcess loaded,
>>but the overflow problem is a bit tricky to debug so I'm posting this in
>>case someone wants to reproduce what I did.
>>
>>Basically this just writes debug trace messages to standard output so I
>>can keep track of what process is running what method in what order.
>>Just some good ol' fashioned Fortran debugging, but what the heck, it
>>worked.
>>
>>>From the preamble:
>>
>>This is what I used to debug the stack overflow problem. Load OSProcess
>>first, then load this change set.
>>
>>Intended for use on Unix/Linux. Run the Squeak vm with a fixed memory
>>allocation (squeak -memory 30m) in order to force the out-of-memory
>>condition.
>>
>>Open a ProcessBrowser, then evaluate 'Smalltalk createStackOverflow'. 
>>You should see messages on stdout that confirm that the runaway
>>recursion keeps going even after the low space semaphore has be
>>signaled.
>>
>>Now apply the LowSpaceWatcherFix change set, and evaluate 'Smalltalk
>>createStackOverflow'. The low space watcher should catch the runaway
>>method right away.
> 
> 
>>Newer Linux VM's grow memory dynamically, and do not start with any
>>explicity memory limit.
> 
> 
> 
> Unix VM memory settings that affect whether or not the bug results
> in a hard crash:
> 
> 
>>There are two command-line options (with equivalent environment 
>>variables) to control how memory is allocated on Unix:
>>
>>   If no options are given then memory is allocated dynamically with the 
>>limit set at 75% of the available virtual memory.
>>
>>   If -memory N{mk} is given then memory is allocated statically; the 
>>argument to the option defines a hard upper limit.
>>
>>   If -mmap N{mk} is given then memory is allocated dynamically, with an 
>>explicit upper limit to the amount of memory that will be allocated 
>>(but the "75% of available virtual memory" limit still applies).
>>
>>Ian
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 




More information about the Squeak-dev mailing list