[Box-Admins] Build.squeak.org jobs stuck

Frank Shearar frank.shearar at gmail.com
Thu Nov 21 13:06:35 UTC 2013


On 21 November 2013 12:35, David T. Lewis <lewis at mail.msen.com> wrote:
> On Thu, Nov 21, 2013 at 10:11:20AM +0000, Frank Shearar wrote:
>> On 21 November 2013 00:47, David T. Lewis <lewis at mail.msen.com> wrote:
>> > On Wed, Nov 20, 2013 at 05:39:35PM -0600, Ken Causey wrote:
>> >> I would really appreciate it if someone with Jenkins expertise would
>> >> look at the situation with build.squeak.org (aka box3).  Multiple jobs
>> >> are stuck and some date back to the 17th.  I would just kill them but it
>> >> seems to me the problem will be right back in a matter of hours or no
>> >> more than a day.  If nothing else maybe the problem job(s) can simply be
>> >> suspended temporarily.
>> >>
>> >> Ken
>> >
>> > I don't know the underlying problem, but there were lots of ruby and squeakvm
>> > processes reparented to root, and no clear indication of what is going wrong.
>> > I killed off as many of the reparented processes as I could find.
>>
>> The underlying problem is this:
>> * rake starts running a build
>> * it spawns a process to fire up a Squeak image, running tests. Call this A.
>> * it also spawns a thread that will, after "too long" has passed -
>> 240s by default - theoretically
>> ** send a USR1 to A
>> ** dump the pstree info for A (extra debug info while we try get this
>> process working reliably)
>> ** send a KILL to A
>> * and yet the squeakvm process is not killed
>>
>
> This analysis sounds right to me.
>
> The command "ps xjf -u jenkins" will show a process tree that gives some insight
> as to what is going on. We currently have one runaway condition, and it looks like
> this:
>
>
> dlewis at box3-squeak:~$ ps xjf -u jenkins
> PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
>  1081  1084  1081  1081 ?           -1 S     1003   0:00 sshd: davidlewis at pts/0
>  1084  1085  1085  1085 pts/0     1136 Ss    1003   0:00  \_ -bash
>  1085  1136  1136  1085 pts/0     1136 R+    1003   0:00      \_ ps xjf -u jenkins
>     1 12249 27120 27120 ?           -1 Sl     103   0:00 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
> 12249 12463 27120 27120 ?           -1 Z      103   0:00  \_ [uname] <defunct>
> 12249 12484 27120 27120 ?           -1 S      103   0:00  \_ sh -c /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeak.sh -vm-sound-null -vm-display-null "/var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image" ../tests.st && echo command 1 finished
> 12484 12487 27120 27120 ?           -1 S      103   9:32      \_ /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/PostTestTrunkImage.image ../tests.st
>     1 27123 27120 27120 ?           -1 S      103   0:03 /usr/bin/daemon --name=jenkins --inherit --env=JENKINS_HOME=/var/lib/jenkins --output=/var/log/jenkins/jenkins.log --pidfile=/var/run/jenkins/jenkins.pid -- /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1
> 27123 27125 27120 27120 ?           -1 Sl     103 174:47  \_ /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080 --ajp13Port=-1
>
> There are two things that look interesting here. First, process 12463
> was started by 12249 (the ruby interpreter process). It has exited, but
> its parent process (12249) has not yet cleaned up after it (see "man 2 wait"
> for background on how this works in unix). This may indicate some issues
> with the way ruby manages its child processes. Or it may not be anything
> significant at all, I can't say for sure.
>
> Second, the ruby process is running the squeakvm executable by starting
> another unix shell, then running squeakvm in that shell. This is the
> line that says "sh -c /var/lib/jenkins/...". This may be the normal way
> in which ruby runs system commands, but in this case it means that the
> subprocess that ruby is monitoring ("process A") is actually that unix
> shell that it started, not the squeakvm itself. So it seems likely that
> watchdog thread in the ruby process is killing the unix shell process,
> but is not killing the squeakvm process.

Right - indeed, I'm shelling out to run the squeakvm, and you've just
confirmed my suspicion that I don't know what I'm doing :). So when I
spawn the process, Ruby's giving me the pid of the shell, and not the
pid of the squeakvm. Thus when I USR1 the pid, I don't get the thing I
want. I'd hoped that pstree would tell me what you just told me; pity
the typo ensured that that line was never reached.

frank

> I am not sure the best way to address this, but at least it gives us
> an idea of where the problem is originating.
>
> Dave
>
>> You can see in the build logs that the thread does attempt to kill the
>> process - look for "!!!" - and yet the tests keep on rolling.
>>
>> What would be useful is finding out to which builds the squeakvm
>> processes belong. We ought to be able to do that quite easily since
>> each squeakvm process will include the job name in its path.
>>
>> I did just discover a typo in the kill-it thread which would result in
>> the "nil doesn't understand the #puts method" error disappearing and
>> nothing happening. I've just committed that change and rerun
>> SqueakTrunk. Hopefully that will help us.
>>
>> > Currently there is one Jenkins job running, and one squeakvm process corresponding
>> > to that job. I'll look again in a day or so and see how many runaway processes
>> > may have come back.
>>
>> Ditto.
>>
>> frank
>>
>> > Dave
>> >


More information about the Box-Admins mailing list