[Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Tue Dec 7 01:06:02 UTC 2010

On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
> 
> At a guess, I'd say it's either one of two issues:
> 
> 1) Your STOP/CONT handling. This sounds suspicious and it could affect 
> the timer handling. I'm assuming that the issue happens after receiving 
> the CONT signal, no? If you can, you might want to a) make sure that you 
> only get the STOP signal when the VM is in ioRelinquish() and not (for 
> example) currently executing the delay process and b) consider to dump 
> the call stacks whenever the VM gets the CONT signal to see what the 
> status is.
> 
> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One 
> of the problems with processes and delays is that this part of the 
> system reacts very badly to random "cleaning". I.e., changing "foo == 
> nil" to "foo isNil" can have dramatic effects (since it introduces a 
> suspension point) with just the kind of weird issue you're seeing.

Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
and loaded the CommandShell and OSProcess test suites. The CommandShell
tests put a heavy load on process switching, and are rather timing
dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
and test failures, and I can't get a clean run of the test suite. The
errors seem to be different each time.

On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
tests, so I think there must be some issues in Pharo 1.1. If you are
using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
or 1.2, I suspect you may see the problems go away.

Dave

> 
> With regards to these processes not being printed, that's a side effect 
> of how printAllStacks gathers the processes - it will not print 
> suspended processes which explains why the UI process doesn't print and 
> most likely handleTimerEvent is suspended in a debugger.
> 
> Depending on how important this issue is you can also try to dissect the 
> object memory itself. If you call writeImageFile (or is it 
> writeImageFileIO?) from gdb it will dump the .image file and you can use 
> the simulator to look at it more closely. Most likely you'll be able to 
> find the processes and look at their stacks.
> 
> Cheers,
>   - Andreas
> 
> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
> >
> >Hi all,
> >
> >We've been experiencing an "interesting" problem: the image freezes and 
> >does not response to HTTP requests anymore after it has been running for 
> >days.
> >
> >Here some basic information about our setup:
> >
> >Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
> >PharoCore 1.1
> >OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
> >
> >- We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9 
> >on the identical machine and with the same application source (modulo some 
> >adaptations to make it run on Pharo).
> >- We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the 
> >UI process is suspended (Project uiProcess suspend)
> >- VM does not hog the CPU and memory usage is normal
> >- The meantime between failure is several weeks and we haven't managed to 
> >reproduce the problem
> >- The application mainly serves HTTP requests. When the image does not 
> >receive requests for some time we send it a STOP signal, when a request 
> >comes in it is sent a CONT signal.
> >- lsof shows
> >	TCP *:9093 (LISTEN)
> >	TCP server:9093->server:46930 (CLOSE_WAIT)
> >
> >Below is a GDB backtrace and the Smalltalk stacks from an image that was 
> >frozen (the VM had been running for almost 100 hours):
> >
> >=============================================================
> >(gdb) bt
> >#0  0x08072020 in ?? ()
> >#1<signal handler called>
> >#2  0xb766f5e0 in malloc () from /lib/libc.so.6
> >#3<function called from gdb>
> >#4  0xb76c50c8 in select () from /lib/libc.so.6
> >#5  0x08071063 in aioPoll ()
> >#6  0xb778bb8d in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-null
> >#7  0x000003e8 in ?? ()
> >#8  0x997b5a34 in ?? ()
> >#9  0xbfe7cb28 in ?? ()
> >#10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
> >Backtrace stopped: frame did not save the PC
> >
> >(gdb) call printCallStack()
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >$3 = -1755344892
> >
> >(gdb) call (int) printAllStacks()
> >Process
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >
> >Process
> >-1740113860>finalizationProcess
> >-1740113952>restartFinalizationProcess
> >-1740113532 BlockClosure>newProcess
> >
> >Process
> >-1740134424 SmalltalkImage>lowSpaceWatcher
> >-1740134516 SmalltalkImage>installLowSpaceWatcher
> >-1740134300 BlockClosure>newProcess
> >
> >Process
> >-1719451488 Delay>wait
> >-1719451580 BlockClosure>ifCurtailed:
> >-1719451704 Delay>wait
> >-1719451796 InputEventPollingFetcher>waitForInput
> >-1740126940 InputEventFetcher>eventLoop
> >-1740127032 InputEventFetcher>installEventLoop
> >-1740126816 BlockClosure>newProcess
> >
> >Process
> >-1719557780 UnixOSProcessAccessor>grimReaperProcess
> >-1740113624 BlockClosure>repeat
> >-1740113716 UnixOSProcessAccessor>grimReaperProcess
> >-1740117340 BlockClosure>newProcess
> >
> >[omitted many newlines between output above]
> >=============================================================
> >
> >What is striking from the above process listing is that two processes are 
> >missing: the handleTimerEvent process and the Seaside process (that is, 
> >the TCP listener loop). How comes these processes vanished?
> >
> >This may be related to Pharo or to the Squeak VM.
> >
> >Has anybody else seen this problem? Any idea how to debug/fix this issue 
> >is very much appreciated!
> >
> >Cheers,
> >Adrian
> >
> >
> >CCed to pharo-dev since this may be related to Pharo; please respond on 
> >the squeak-vm list
> >
> >
> >