[squeak-dev] OSProcess endless loop

David T. Lewis lewis at mail.msen.com
Fri Apr 15 13:44:26 UTC 2022

Hi Chris,

On Thu, Apr 14, 2022 at 09:18:21PM -0400, David T. Lewis wrote:
> Hi Chris,
> <ACK> This might be tricky to debug, but message received. See below.
> On Wed, Apr 13, 2022 at 09:36:57PM -0500, Chris Muller wrote:
> > Hi Dave,
> > 
> > I've been having occasional issues with OSProcess somehow getting
> > hosed up and becoming unusable in my image.  Tonight it started again,
> > and I'm rather stuck in the water at the moment, unsure how to break
> > out of it other than building a new image.
> > 
> > Magma now uses OSProcess "outputOf: 'free -wb'" every few seconds to
> > get ahead of any potential OutOfMemory signals.  But somehow my image
> > got into a state where the simplest uses of OSProcess lock up.
> > Whenever it happens, I see these messages in the console:
> > 
> > 364147968:982663168:[] in
> > AioEventHandler>>initializeForExceptions:readEvents:writeEvents::aio
> > event forwarding not supported
> > 364147968:37728768:[] in
> > AioEventHandler>>initializeForExceptions:readEvents:writeEvents::aio
> > event forwarding not supported
> > 
> > From my feeble debugging, it seems #primRead:into:startingAt:count: is
> > returning a 0 count, which is what leads to OSProcess's
> > logic-flow to never be able to break out of the while loop in
> > BufferedAsyncFileReadStream>>#upToEndOfFile.  Here's a rough stack
> > trace of that loop:
> > 
> >     BufferedAsyncFileReadStream>>#upToEndOfFile
> >     BufferedAsyncFileReadStream>>#atEndOfFile
> >         (readBuffer atEnd = true, OSProcess accessor isAtEndOfFile:
> > fileID returns false)
> >     BufferedAsyncFileReadStream>>#readAvailableDataFrom:into:
> >     primRead:into:startingAt:count: (---> answers 0)
> >     OSProcessAccessor>>#isAtEndOfFile: (---> answers false)
> >     (restart loop in BufferedAsyncFileReadStream>>#upToEndOfFile)
> > 
> > It would be nice if OSProcess could detect this situation and signal
> > some kind of error.  With the endless loop, it sometimes takes a while
> > to get to the bottom of why something isn't responsive.
> > 
> > I'm running production 5.3 with the latest OSProcess and CommandShell.
> > I thought it might be a resource issue on my laptop, but rebooting
> > didn't help.  Rebuilding from fresh 5.3 image always works, however,
> > the weirdest thing is, the problem seems to clear ITSELF up.  Like,
> > OMG, right now, it's working again!  I had just run a test in a fresh
> > image to test multiple processes hitting OSProcess outputOf:.  It
> > worked fine and when I came back to my problem image, it's suddenly
> > working again!
> > 
> > Can you think of anything I might be doing to get into this situation
> > and/or how to break out of it?  Something to avoid or initialize?
> > 
> > The other thing I noticed, when I would break into the locked up
> > OSProcess with Cmd+. (dot), there were TWO processes stuck
> > in the loop, one from my DoIt, the other originating from the line:
> > 
> >    "self changed: #childProcessStatus"
> > 
> > of #grimReaperProcess.  Sigh..  I apparently already closed those
> > debuggers and now can't reproduce the issue to paste their bug report
> > stack traces!  Sorry.
> > 
> > I hope I'm not the only one experiencing this issue so we can
> > hopefully track this down.  It's insidious when it happens.
> > 
> > Thanks,
> >   Chris
> > 
> A bit off the top of my head and totally untested, but the first thing
> that comes to mind is that the check for  #isAtEndOfFile is calling
> <primitive: 'primitiveTestEndOfFileFlag' module: 'UnixOSProcessPlugin'>
> which is probably going to see an end of file on the pipe only if the
> external child OSProcess has actually terminated normally. Terminating
> normally means not just that the child process (running your 'free -wb'
> command in a bash shell) has exited, but also that the parent process
> (the Squeak VM) has noticed it and has harvested the child exit status.
> There is a Smalltalk process ('the child OSProcess watcher' in your process
> browser) that is responsible for harvesting the child process to get its
> exit status. This is event driven and can sometimes be a bit fragile. So
> the first thing that comes to mind if you have an image stuck in the
> condition that you describe would be to give that process a kick and the
> pants and see if things get better.
> If you can catch one of your images in a wedged condition, here are some
> do-its that might possibly break it free. If either of these work, it
> will give us an idea of where the underlying problem  might lie.
> 	OSProcess thisOSProcess updateActiveChildren.
> 	OSProcess accessor restartChildWatcherProcess.
> My guess is that the #updateActiveChildren method is most likely to
> un-wedge things but this is definitely just a guess.
> Dave

A couple of other things you might check:

1) There is a registry of current and past child processes:

	"OSProcess thisOSProcess allMyChildren"

This grows until the next image restart (ThisOSProcess class>>startUp:)
and could be a big memory leak if you are running a lot of commands
in a long-running Squeak image.  If that is a problem, see the methods
in ThisOSProcess category 'child process management' for ideas.

2) Check and see if a file handle leak may be happening:

	"(OSProcess outputOf: 'ls /proc/', OSProcess thisOSProcess pid asString, '/fd | wc -l') asInteger"

OSProcess tries to keep file handles cleaned up, but it's worth checking
in case there is a leak.


More information about the Squeak-dev mailing list