[Box-Admins] Build.squeak.org jobs stuck

David T. Lewis lewis at mail.msen.com
Thu Nov 21 12:35:19 UTC 2013


On Thu, Nov 21, 2013 at 10:11:20AM +0000, Frank Shearar wrote:
> On 21 November 2013 00:47, David T. Lewis <lewis at mail.msen.com> wrote:
> > On Wed, Nov 20, 2013 at 05:39:35PM -0600, Ken Causey wrote:
> >> I would really appreciate it if someone with Jenkins expertise would
> >> look at the situation with build.squeak.org (aka box3).  Multiple jobs
> >> are stuck and some date back to the 17th.  I would just kill them but it
> >> seems to me the problem will be right back in a matter of hours or no
> >> more than a day.  If nothing else maybe the problem job(s) can simply be
> >> suspended temporarily.
> >>
> >> Ken
> >
> > I don't know the underlying problem, but there were lots of ruby and squeakvm
> > processes reparented to root, and no clear indication of what is going wrong.
> > I killed off as many of the reparented processes as I could find.
> 
> The underlying problem is this:
> * rake starts running a build
> * it spawns a process to fire up a Squeak image, running tests. Call this A.
> * it also spawns a thread that will, after "too long" has passed -
> 240s by default - theoretically
> ** send a USR1 to A
> ** dump the pstree info for A (extra debug info while we try get this
> process working reliably)
> ** send a KILL to A
> * and yet the squeakvm process is not killed
> 

This analysis sounds right to me.

The command "ps xjf -u jenkins" will show a process tree that gives some insight
as to what is going on. We currently have one runaway condition, and it looks like
this:


dlewis at box3-squeak:~$ ps xjf -u jenkins
PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
 1081  1084  1081  1081 ?           -1 S     1003   0:00 sshd: davidlewis at pts/0
 1084  1085  1085  1085 pts/0     1136 Ss    1003   0:00  \_ -bash
 1085  1136  1136  1085 pts/0     1136 R+    1003   0:00      \_ ps xjf -u jenkins
    1 12249 27120 27120 ?           -1 Sl     103   0:00 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
12249 12463 27120 27120 ?           -1 Z      103   0:00  \_ [uname] <defunct>
12249 12484 27120 27120 ?           -1 S      103   0:00  \_ sh -c /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeak.sh -vm-sound-null -vm-display-null "/var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image" ../tests.st && echo command 1 finished
12484 12487 27120 27120 ?           -1 S      103   9:32      \_ /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image ../tests.st
    1 27123 27120 27120 ?           -1 S      103   0:03 /usr/bin/daemon --name=jenkins --inherit --env=JENKINS_HOME=/var/lib/jenkins --output=/var/log/jenkins/jenkins.log --pidfile=/var/run/jenkins/jenkins.pid -- /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1
27123 27125 27120 27120 ?           -1 Sl     103 174:47  \_ /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1

There are two things that look interesting here. First, process 12463
was started by 12249 (the ruby interpreter process). It has exited, but
its parent process (12249) has not yet cleaned up after it (see "man 2 wait"
for background on how this works in unix). This may indicate some issues
with the way ruby manages its child processes. Or it may not be anything
significant at all, I can't say for sure.

Second, the ruby process is running the squeakvm executable by starting
another unix shell, then running squeakvm in that shell. This is the
line that says "sh -c /var/lib/jenkins/...". This may be the normal way
in which ruby runs system commands, but in this case it means that the
subprocess that ruby is monitoring ("process A") is actually that unix
shell that it started, not the squeakvm itself. So it seems likely that
watchdog thread in the ruby process is killing the unix shell process,
but is not killing the squeakvm process.

I am not sure the best way to address this, but at least it gives us
an idea of where the problem is originating.

Dave

> You can see in the build logs that the thread does attempt to kill the
> process - look for "!!!" - and yet the tests keep on rolling.
> 
> What would be useful is finding out to which builds the squeakvm
> processes belong. We ought to be able to do that quite easily since
> each squeakvm process will include the job name in its path.
> 
> I did just discover a typo in the kill-it thread which would result in
> the "nil doesn't understand the #puts method" error disappearing and
> nothing happening. I've just committed that change and rerun
> SqueakTrunk. Hopefully that will help us.
> 
> > Currently there is one Jenkins job running, and one squeakvm process corresponding
> > to that job. I'll look again in a day or so and see how many runaway processes
> > may have come back.
> 
> Ditto.
> 
> frank
> 
> > Dave
> >


More information about the Box-Admins mailing list