6293LowSpaceWatcherFix-dtl considered harmful
Andreas Raab
andreas.raab at gmx.de
Fri Apr 1 02:27:02 UTC 2005
Hi David -
Thanks for the history. Is there any post with a thorough analysis of
the sequence of events that made things go wrong, and consequently, why
changing the code the way you did would address this the problem? I'm
asking, because I am still not clear about why a change in this
particular place would make any difference on the low space watcher
behavior whatsoever (neither did the debugging facilities you sent along
seem to be related to Project>>interruptName:).
Cheers,
- Andreas
David T. Lewis wrote:
> Hi Andreas,
>
> I am the source of this patch.
>
> If you have a copy of BFAV available, the reviews of this issue
> are under the "update" tab with the subject line:
> "[bug][FIX] stack overflow crashes Squeak. (one-line fix attached)"
>
> In case you do not have a working BFAV handy, I am attaching a
> copy of the change set that I used originally to debug the problem
> (runs on Unix, sorry, but you'll get the idea). I'm also copying
> some of the relevant discussion at the end of this message.
>
> Note that the symptoms of this problem appear differently on
> different VMs and different memory allocation settings. It also
> did not occur until after the introduction of the event tickler
> process into the image, so this appeared to be a pre-existing
> bug that had been masked until the event tickler process was
> added.
>
> I'm short of time right now, but I'll take another look at this
> as soon as I get some free time. Having said that, I'm no expert
> on the process scheduler or memory manager, so a better informed
> explanation and/or fix would be welcome.
>
> On Thu, Mar 31, 2005 at 12:48:38AM -0800, Andreas Raab wrote:
>
>>Hi Folks -
>>
>>Trying to track a problem in Tweak I ran into the above mentioned
>>update. I won't comment on the process of how it got into the image (I
>>think you all know my opinion about eyeballing system-critical changes
>>and I am almost certain that the approval went something like "oh, what
>>possible harm could a one-line change do") but I am rather interested in
>>what problem this is trying to fix.
>
>
> The low space interrupt failed to get through to the image. When running
> with fixed memory allocation, the result was a VM crash.
>
>
>>The comment of the change set states: "The low space watcher is
>>interrupted in the context of the wrong process when the eventTickler
>>process (or other high priority process) is running. This prevents low
>>space detection from functioning properly."
>>
>>But this makes no sense whatsoever.
>
>
> Poor wording and/or misunderstanding on my part. If I remember correctly,
> the low space interrupt was "appearing" in the wrong process context
> after the semaphore was signaled.
>
>
>>Whatever this update is trying to address, it cannot possibly be what is
>>being claimed in the preamble. So what problem is this update trying to
>>solve?
>
>
> Uninterruptable recursion followed by a hard VM crash on some VMs and
> memory settings. Not a good thing for naive users who would be most
> likely to make mistakes of this kind.
>
> HTH,
> Dave
>
> -------------------------------------
> Snippings from BFAV:
>
> Background on how and when the bug first became evident:
>
>
>>Hi Doug,
>>
>>I have not reconstructed from old images, but my best guess is that
>>the bug entered the image with the Project class>>interruptName:
>>method, which is time stamped 9/5/2001. The bug was present at that
>>point, but was not manifested until someone else added a high priority
>>background process into the system, which just happened to be the
>>otherwise blameless EventTickler process (EventSensor>>eventTickler),
>>which runs all the time at lowIOPriority. This was introduced in
>>update 5000 in April 2004, so I would expect that people started
>>noticing the problem after that time.
>>
>>The current thread in BFAV dates back to September 2003, which
>>suggests that the symptoms of this problem were being seen before
>>the EventTickler was introduced (or maybe I just did not follow
>>the trail all the way back). So I'm not entirely sure how long
>>people have been seeing symptoms of the problem. I'm reasonably
>>sure (call it about 90% confidence level, gut feel) that the
>>bug/fix that I posted addresses the underlying issue, although
>>I would not be surprised if it turns out that there are other
>>lurking buglets that might lead to similar symptoms.
>>
>>Important: This one is timing-dependent, and you may see different
>>symptoms depending on the VM and any memory settings you may have
>>used on the command line. On my Linux system, if I force a limit
>>on the amount of memory used (with "squeak -memory 10m"), I end
>>up with a real VM crash, stack dump and all. If I don't limit the
>>memory (which would be the normal mode of use), the image just
>>becomes unresponsive when it gets into an infinite recursion, and
>>cannot be interrupted. Presumably the VM is busy trying to allocate
>>more memory from the OS, but does not actually crash while it's
>>chewing away on this problem.
>>
>>I thought that I had read an earlier report that (John's) Mac
>>OS X does not exhibit the problem, apparently due to its use of
>>a threaded VM that is more responsive to UI events. However, your
>>description of the behavior on your OS X system sounds quite similar
>>to what I see on Linux.
>>
>>RiscOS seems to behave similarly to Linux, and I don't know what
>>Windows does.
>
>
>
> How to reproduce the bug:
>
>
>>Attached is a change set that I used to debug the stack overflow problem
>>and confirm the fix. This only runs on a Unix VM with OSProcess loaded,
>>but the overflow problem is a bit tricky to debug so I'm posting this in
>>case someone wants to reproduce what I did.
>>
>>Basically this just writes debug trace messages to standard output so I
>>can keep track of what process is running what method in what order.
>>Just some good ol' fashioned Fortran debugging, but what the heck, it
>>worked.
>>
>>>From the preamble:
>>
>>This is what I used to debug the stack overflow problem. Load OSProcess
>>first, then load this change set.
>>
>>Intended for use on Unix/Linux. Run the Squeak vm with a fixed memory
>>allocation (squeak -memory 30m) in order to force the out-of-memory
>>condition.
>>
>>Open a ProcessBrowser, then evaluate 'Smalltalk createStackOverflow'.
>>You should see messages on stdout that confirm that the runaway
>>recursion keeps going even after the low space semaphore has be
>>signaled.
>>
>>Now apply the LowSpaceWatcherFix change set, and evaluate 'Smalltalk
>>createStackOverflow'. The low space watcher should catch the runaway
>>method right away.
>
>
>>Newer Linux VM's grow memory dynamically, and do not start with any
>>explicity memory limit.
>
>
>
> Unix VM memory settings that affect whether or not the bug results
> in a hard crash:
>
>
>>There are two command-line options (with equivalent environment
>>variables) to control how memory is allocated on Unix:
>>
>> If no options are given then memory is allocated dynamically with the
>>limit set at 75% of the available virtual memory.
>>
>> If -memory N{mk} is given then memory is allocated statically; the
>>argument to the option defines a hard upper limit.
>>
>> If -mmap N{mk} is given then memory is allocated dynamically, with an
>>explicit upper limit to the amount of memory that will be allocated
>>(but the "75% of available virtual memory" limit still applies).
>>
>>Ian
>
>
>
> ------------------------------------------------------------------------
>
>
More information about the Squeak-dev
mailing list
|