[Box-Admins] Re: box4 is inaccessible

David T. Lewis lewis at mail.msen.com
Wed Dec 2 02:14:18 UTC 2015


On Tue, Dec 01, 2015 at 05:39:01PM -0600, Chris Muller wrote:
> It is running a much newer image (4.5 instead of 3.11), I don't know if it
> can support the special chars or not, but as I went to test it, I found teh
> cause of the box4 crash, and a number of other problems:
> 
>    - my own "repcopy" script which is running on an hourly cron is getting
> some error, but leaving the process (squeak iamge) running.  My bet is box4
> ran out of memory and crashed.  I just killed about 50 of those processes.
> The script redirects stdout and stderr to "repcopy.log" and "repcopy.err",
> respectively.  However, upon the next scheduled run, I found the process
> hung again but those two files were empty.  However, there was a
> SqueakDebug.log file...

Yep, that would do it. System out of memory, maybe swapping or maybe not,
but either way not able to fork processes to serve new ssh connections.
The system is still alive and responding to pings, but nothing else works.
That's exactly what we were seeing.

Squeak images interacting with MCZ files are not failure proof. If you run
one from a cron, you'll want to add some extra protection.  You could do
some sort of process lock so the cron job first kills off any lingering
processes from the last run. Or you can maybe make do by adding a watchdog
process (*) in the image itself to exit unconditionally after N minutes if
the image is still alive.

> 
>   - The SqueakDebug.log file shows *Squeak-Version-kfr.4712* is triggering
> the Warning, "About to serialize empty mcz".  This version has a comment
> of, "Messed up last version, try again", and so is a piece of litter in our
> ancestry and trunk.  I again ask y'all to support me on this littering
> issue and help me address it with community.  Please!
> 

Well and good, but you cannot rely on that. Bad stuff will happen, people
will make mistakes, and images will get stuck.

Dave

(*) As an example, here is the watchdog process that I am using in the
VMUnixBuild.st script in the InterpreterVM job on build.squeak.org:

"If exitWhenDone is true, image will exit without saving when this
script is complete or if an error is encountered."

exitWhenDone := true.
watchdogMinutes := 20. "Kill image if not done in this many minutes"

"Set a watchdog timer to kill this image if it gets stuck while running
headless. Applies only if the image is set to exit when complete."
exitWhenDone ifTrue: [
        [(Delay forSeconds: 60 * watchdogMinutes) wait.
        log value: 'Watchdog: killing image after ', watchdogMinutes asString, ' minutes'.
        OSProcess thisOSProcess sigkill: OSProcess thisOSProcess
        ] forkAt: Processor userInterruptPriority].





More information about the Box-Admins mailing list