[Box-Admins] Build.squeak.org jobs stuck
David T. Lewis
lewis at mail.msen.com
Thu Nov 21 12:35:19 UTC 2013
On Thu, Nov 21, 2013 at 10:11:20AM +0000, Frank Shearar wrote:
> On 21 November 2013 00:47, David T. Lewis <lewis at mail.msen.com> wrote:
> > On Wed, Nov 20, 2013 at 05:39:35PM -0600, Ken Causey wrote:
> >> I would really appreciate it if someone with Jenkins expertise would
> >> look at the situation with build.squeak.org (aka box3). Multiple jobs
> >> are stuck and some date back to the 17th. I would just kill them but it
> >> seems to me the problem will be right back in a matter of hours or no
> >> more than a day. If nothing else maybe the problem job(s) can simply be
> >> suspended temporarily.
> >> Ken
> > I don't know the underlying problem, but there were lots of ruby and squeakvm
> > processes reparented to root, and no clear indication of what is going wrong.
> > I killed off as many of the reparented processes as I could find.
> The underlying problem is this:
> * rake starts running a build
> * it spawns a process to fire up a Squeak image, running tests. Call this A.
> * it also spawns a thread that will, after "too long" has passed -
> 240s by default - theoretically
> ** send a USR1 to A
> ** dump the pstree info for A (extra debug info while we try get this
> process working reliably)
> ** send a KILL to A
> * and yet the squeakvm process is not killed
This analysis sounds right to me.
The command "ps xjf -u jenkins" will show a process tree that gives some insight
as to what is going on. We currently have one runaway condition, and it looks like
dlewis at box3-squeak:~$ ps xjf -u jenkins
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
1081 1084 1081 1081 ? -1 S 1003 0:00 sshd: davidlewis at pts/0
1084 1085 1085 1085 pts/0 1136 Ss 1003 0:00 \_ -bash
1085 1136 1136 1085 pts/0 1136 R+ 1003 0:00 \_ ps xjf -u jenkins
1 12249 27120 27120 ? -1 Sl 103 0:00 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
12249 12463 27120 27120 ? -1 Z 103 0:00 \_ [uname] <defunct>
12249 12484 27120 27120 ? -1 S 103 0:00 \_ sh -c /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-126.96.36.19914-src-32/bld/squeak.sh -vm-sound-null -vm-display-null "/var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image" ../tests.st && echo command 1 finished
12484 12487 27120 27120 ? -1 S 103 9:32 \_ /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-188.8.131.5214-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image ../tests.st
1 27123 27120 27120 ? -1 S 103 0:03 /usr/bin/daemon --name=jenkins --inherit --env=JENKINS_HOME=/var/lib/jenkins --output=/var/log/jenkins/jenkins.log --pidfile=/var/run/jenkins/jenkins.pid -- /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1
27123 27125 27120 27120 ? -1 Sl 103 174:47 \_ /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1
There are two things that look interesting here. First, process 12463
was started by 12249 (the ruby interpreter process). It has exited, but
its parent process (12249) has not yet cleaned up after it (see "man 2 wait"
for background on how this works in unix). This may indicate some issues
with the way ruby manages its child processes. Or it may not be anything
significant at all, I can't say for sure.
Second, the ruby process is running the squeakvm executable by starting
another unix shell, then running squeakvm in that shell. This is the
line that says "sh -c /var/lib/jenkins/...". This may be the normal way
in which ruby runs system commands, but in this case it means that the
subprocess that ruby is monitoring ("process A") is actually that unix
shell that it started, not the squeakvm itself. So it seems likely that
watchdog thread in the ruby process is killing the unix shell process,
but is not killing the squeakvm process.
I am not sure the best way to address this, but at least it gives us
an idea of where the problem is originating.
> You can see in the build logs that the thread does attempt to kill the
> process - look for "!!!" - and yet the tests keep on rolling.
> What would be useful is finding out to which builds the squeakvm
> processes belong. We ought to be able to do that quite easily since
> each squeakvm process will include the job name in its path.
> I did just discover a typo in the kill-it thread which would result in
> the "nil doesn't understand the #puts method" error disappearing and
> nothing happening. I've just committed that change and rerun
> SqueakTrunk. Hopefully that will help us.
> > Currently there is one Jenkins job running, and one squeakvm process corresponding
> > to that job. I'll look again in a day or so and see how many runaway processes
> > may have come back.
> > Dave
More information about the Box-Admins