[Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

Frank Shearar frank.shearar at gmail.com
Sat Nov 9 19:45:31 UTC 2013


On 9 November 2013 17:49, David T. Lewis <lewis at mail.msen.com> wrote:
> On Sat, Nov 09, 2013 at 09:58:33AM +0000, Frank Shearar wrote:
>> Any process like that - with /var/lib/jenkins/workspace/ in its path -
>> that has been running for more than an hour is a renegade and should
>> be shot on sight.
>
> Agreed. I am mainly just reporting these things so that all of us on the
> list are aware of issues as they arise. I don't think that any specific
> follow-up is required for any of them, aside from keeping them in mind
> as we move forward. I have a vague hunch - possibly wrong - that there
> is some interaction between the issues we saw on source.squeak.org and
> the issues that showed up on build.squeak.org. I have no idea of the
> root cause(s) but I expect that if we pay attention to the symptoms
> we'll put the pieces of the puzzle together eventually.

I can well imagine an interaction between the two, since builds
typically follow a pattern of update-some-base-image, test-that-image.
"update-some-base-image" of course involves source.squeak.org quite
heavily.

frank

> Dave
>
>>
>> I expect to see two threads in the Ruby process: one for running the
>> tests, and a watchdog that is _supposed_ to kill long-running jobs.
>>
>> As it happens, I realised only yesterday that if you nice something in
>> Ruby, through, say, spawn("nice echo 1"), you get the pid of the
>> _nice_, not the pid of the _echo_. If you replace "echo" with "invoke
>> a squeak", then you can't SIGUSR1 the Squeak because you don't know
>> its pid. I removed the nice-ing, in the hope that with the actual pid
>> of the squeak process (as opposed to the pid of the nice, its parent),
>> I could more reliably kill these jobs.
>>
>> frank
>>
>>
>> On 9 November 2013 04:44, David T. Lewis <lewis at mail.msen.com> wrote:
>> > I killed the runaway ruby process, and confirmed that the InterpreterVM job
>> > once again runs successfully.
>> >
>> > In addition to the ruby process, there are four reparented squeakvm processes
>> > that have been running for a couple of days, although not consuming much CPU.
>> > I will kill these also. The four processes are:
>> >
>> > UID        PID  PPID  C STIME TTY          TIME CMD
>> > jenkins   7983     1  0 Nov07 ?        00:21:01 /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspac
>> > jenkins   8861     1  0 Nov07 ?        00:20:46 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/
>> > jenkins   8972     1  0 Nov07 ?        00:21:06 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/
>> > jenkins  19136     1  0 Nov07 ?        00:15:27 /var/lib/jenkins/workspace/ExternalPackages-Squeak4.3/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/
>> >
>> > Dave
>> >
>> >
>> > On Fri, Nov 08, 2013 at 09:16:52PM -0500, David T. Lewis wrote:
>> >> Thanks Ken,
>> >>
>> >> The runaway process is producing no strace output at all, so it is not
>> >> making any system calls. According to /proc/28923/status there is a very
>> >> large amount of context switching going on:
>> >>
>> >>   nonvoluntary_ctxt_switches:     61281919
>> >>
>> >> And the process has two active threads.
>> >>
>> >> I'm going to kill the stuck process and see if that clears up some resource
>> >> for the other Jenkins jobs.
>> >>
>> >> Dave
>> >>
>> >> On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:
>> >> > strace and strace64 are now installed on box3.  Of course anyone with
>> >> > sudo access could have done the same.
>> >> >
>> >> > Ken
>> >> >
>> >> > > -------- Original Message --------
>> >> > > Subject: Re: [Box-Admins] Runaway ruby process on box3 (was:
>> >> > > [squeak-dev] trunk down again)
>> >> > > From: "David T. Lewis" <lewis at mail.msen.com>
>> >> > > Date: Fri, November 08, 2013 6:23 pm
>> >> > > To: Squeak Hosting Support <box-admins at lists.squeakfoundation.org>
>> >> > >
>> >> > >
>> >> > > On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:
>> >> > > > On 8 November 2013 23:06, David T. Lewis <lewis at mail.msen.com> wrote:
>> >> > > > > On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
>> >> > > > >> Trunk stopped responding again.  box2 might need to be rebooted again.
>> >> > > > >
>> >> > > > > We have a ruby process running on box3 that is consuming all available CPU,
>> >> > > > > that has been reparented to init, and that has been running for a long time.
>> >> > > > >
>> >> > > > >   jenkins at box3-squeak:~$ ps -aef | grep ruby
>> >> > > > >   jenkins  19054 18955 0 22:48 pts/1    00:00:00 grep ruby
>> >> > > > >   jenkins  28923     1 87 Nov06 ?        1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
>> >> > > > >   jenkins at box3-squeak:~$ ps -l -p 28923
>> >> > > > >   F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
>> >> > > > >   0 R   103 28923     1 87  80   0 -  5205 -      ?        1-19:18:19 ruby
>> >> > > > >   jenkins at box3-squeak:~$ top -p 28923 -b -n 1
>> >> > > > >   top - 22:48:47 up 199 days,  7:38,  2 users,  load average: 7.11, 7.11, 7.16
>> >> > > > >   Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
>> >> > > > >   Cpu(s):  2.0%us,  0.2%sy,  2.0%ni, 95.3%id,  0.4%wa,  0.0%hi,  0.0%si,  0.1%st
>> >> > > > >   Mem:   1032140k total,  1009656k used,    22484k free,    74156k buffers
>> >> > > > >   Swap:   524280k total,    10732k used,   513548k free,   209540k cached
>> >> > > > >
>> >> > > > >     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >> > > > >   28923 jenkins   20   0 20820  12m  828 R 87.0  1.3   2598:23 ruby
>> >> > > > >
>> >> > > > > This appears to be a process that got disconnected from one of our Jenkins
>> >> > > > > jobs and has been stuck burning cpu for that last couple of days. That also
>> >> > > > > happens to be roughly the time frame in which our source.squeak.org service
>> >> > > > > got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with
>> >> > > > > source.squeak.org, so it is possible that the two problems are related.
>> >> > > > >
>> >> > > > > I noticed this because the InterpreterVM and CogVM jobs are failing after
>> >> > > > > their watchdog timers expire, but the actual jobs succeed if I run them on
>> >> > > > > my own local PC. Those jobs run Squeak at low priority (nice) and it is
>> >> > > > > possible that their failures are due to the runaway ruby job consuming all
>> >> > > > > available resource.
>> >> > > > >
>> >> > > > > I have not killed the runaway process yet, in case anyone wants to have a
>> >> > > > > look at first.
>> >> > > >
>> >> > > > I would be happy if you attached strace to it, collected some data,
>> >> > > > and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear
>> >> > > > you know that, but I just had to point it out.) Hopefully the strace
>> >> > > > would give enough clues to find what looks like a tight loop...
>> >> > > >
>> >> > >
>> >> > > We don't have strace installed on box3, sorry.
>> >> > >
>> >> > > Dave


More information about the Box-Admins mailing list