[Vm-dev] Image freeze because handleTimerEvent and Seaside
process gone?!
David T. Lewis
lewis at mail.msen.com
Tue Dec 7 01:06:02 UTC 2010
On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
>
> At a guess, I'd say it's either one of two issues:
>
> 1) Your STOP/CONT handling. This sounds suspicious and it could affect
> the timer handling. I'm assuming that the issue happens after receiving
> the CONT signal, no? If you can, you might want to a) make sure that you
> only get the STOP signal when the VM is in ioRelinquish() and not (for
> example) currently executing the delay process and b) consider to dump
> the call stacks whenever the VM gets the CONT signal to see what the
> status is.
>
> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One
> of the problems with processes and delays is that this part of the
> system reacts very badly to random "cleaning". I.e., changing "foo ==
> nil" to "foo isNil" can have dramatic effects (since it introduces a
> suspension point) with just the kind of weird issue you're seeing.
Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
and loaded the CommandShell and OSProcess test suites. The CommandShell
tests put a heavy load on process switching, and are rather timing
dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
and test failures, and I can't get a clean run of the test suite. The
errors seem to be different each time.
On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
tests, so I think there must be some issues in Pharo 1.1. If you are
using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
or 1.2, I suspect you may see the problems go away.
Dave
>
> With regards to these processes not being printed, that's a side effect
> of how printAllStacks gathers the processes - it will not print
> suspended processes which explains why the UI process doesn't print and
> most likely handleTimerEvent is suspended in a debugger.
>
> Depending on how important this issue is you can also try to dissect the
> object memory itself. If you call writeImageFile (or is it
> writeImageFileIO?) from gdb it will dump the .image file and you can use
> the simulator to look at it more closely. Most likely you'll be able to
> find the processes and look at their stacks.
>
> Cheers,
> - Andreas
>
> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
> >
> >Hi all,
> >
> >We've been experiencing an "interesting" problem: the image freezes and
> >does not response to HTTP requests anymore after it has been running for
> >days.
> >
> >Here some basic information about our setup:
> >
> >Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
> >PharoCore 1.1
> >OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
> >
> >- We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9
> >on the identical machine and with the same application source (modulo some
> >adaptations to make it run on Pharo).
> >- We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the
> >UI process is suspended (Project uiProcess suspend)
> >- VM does not hog the CPU and memory usage is normal
> >- The meantime between failure is several weeks and we haven't managed to
> >reproduce the problem
> >- The application mainly serves HTTP requests. When the image does not
> >receive requests for some time we send it a STOP signal, when a request
> >comes in it is sent a CONT signal.
> >- lsof shows
> > TCP *:9093 (LISTEN)
> > TCP server:9093->server:46930 (CLOSE_WAIT)
> >
> >Below is a GDB backtrace and the Smalltalk stacks from an image that was
> >frozen (the VM had been running for almost 100 hours):
> >
> >=============================================================
> >(gdb) bt
> >#0 0x08072020 in ?? ()
> >#1<signal handler called>
> >#2 0xb766f5e0 in malloc () from /lib/libc.so.6
> >#3<function called from gdb>
> >#4 0xb76c50c8 in select () from /lib/libc.so.6
> >#5 0x08071063 in aioPoll ()
> >#6 0xb778bb8d in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-null
> >#7 0x000003e8 in ?? ()
> >#8 0x997b5a34 in ?? ()
> >#9 0xbfe7cb28 in ?? ()
> >#10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
> >Backtrace stopped: frame did not save the PC
> >
> >(gdb) call printCallStack()
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >$3 = -1755344892
> >
> >(gdb) call (int) printAllStacks()
> >Process
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >
> >Process
> >-1740113860>finalizationProcess
> >-1740113952>restartFinalizationProcess
> >-1740113532 BlockClosure>newProcess
> >
> >Process
> >-1740134424 SmalltalkImage>lowSpaceWatcher
> >-1740134516 SmalltalkImage>installLowSpaceWatcher
> >-1740134300 BlockClosure>newProcess
> >
> >Process
> >-1719451488 Delay>wait
> >-1719451580 BlockClosure>ifCurtailed:
> >-1719451704 Delay>wait
> >-1719451796 InputEventPollingFetcher>waitForInput
> >-1740126940 InputEventFetcher>eventLoop
> >-1740127032 InputEventFetcher>installEventLoop
> >-1740126816 BlockClosure>newProcess
> >
> >Process
> >-1719557780 UnixOSProcessAccessor>grimReaperProcess
> >-1740113624 BlockClosure>repeat
> >-1740113716 UnixOSProcessAccessor>grimReaperProcess
> >-1740117340 BlockClosure>newProcess
> >
> >[omitted many newlines between output above]
> >=============================================================
> >
> >What is striking from the above process listing is that two processes are
> >missing: the handleTimerEvent process and the Seaside process (that is,
> >the TCP listener loop). How comes these processes vanished?
> >
> >This may be related to Pharo or to the Squeak VM.
> >
> >Has anybody else seen this problem? Any idea how to debug/fix this issue
> >is very much appreciated!
> >
> >Cheers,
> >Adrian
> >
> >
> >CCed to pharo-dev since this may be related to Pharo; please respond on
> >the squeak-vm list
> >
> >
> >
More information about the Vm-dev
mailing list