[Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

David T. Lewis lewis at mail.msen.com
Sat Nov 9 02:16:52 UTC 2013


Thanks Ken,

The runaway process is producing no strace output at all, so it is not
making any system calls. According to /proc/28923/status there is a very
large amount of context switching going on:

  nonvoluntary_ctxt_switches:     61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource
for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:
> strace and strace64 are now installed on box3.  Of course anyone with
> sudo access could have done the same.
> 
> Ken
> 
> > -------- Original Message --------
> > Subject: Re: [Box-Admins] Runaway ruby process on box3 (was:
> > [squeak-dev] trunk down again)
> > From: "David T. Lewis" <lewis at mail.msen.com>
> > Date: Fri, November 08, 2013 6:23 pm
> > To: Squeak Hosting Support <box-admins at lists.squeakfoundation.org>
> > 
> > 
> > On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:
> > > On 8 November 2013 23:06, David T. Lewis <lewis at mail.msen.com> wrote:
> > > > On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
> > > >> Trunk stopped responding again.  box2 might need to be rebooted again.
> > > >
> > > > We have a ruby process running on box3 that is consuming all available CPU,
> > > > that has been reparented to init, and that has been running for a long time.
> > > >
> > > >   jenkins at box3-squeak:~$ ps -aef | grep ruby
> > > >   jenkins  19054 18955 0 22:48 pts/1    00:00:00 grep ruby
> > > >   jenkins  28923     1 87 Nov06 ?        1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
> > > >   jenkins at box3-squeak:~$ ps -l -p 28923
> > > >   F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> > > >   0 R   103 28923     1 87  80   0 -  5205 -      ?        1-19:18:19 ruby
> > > >   jenkins at box3-squeak:~$ top -p 28923 -b -n 1
> > > >   top - 22:48:47 up 199 days,  7:38,  2 users,  load average: 7.11, 7.11, 7.16
> > > >   Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > >   Cpu(s):  2.0%us,  0.2%sy,  2.0%ni, 95.3%id,  0.4%wa,  0.0%hi,  0.0%si,  0.1%st
> > > >   Mem:   1032140k total,  1009656k used,    22484k free,    74156k buffers
> > > >   Swap:   524280k total,    10732k used,   513548k free,   209540k cached
> > > >
> > > >     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > >   28923 jenkins   20   0 20820  12m  828 R 87.0  1.3   2598:23 ruby
> > > >
> > > > This appears to be a process that got disconnected from one of our Jenkins
> > > > jobs and has been stuck burning cpu for that last couple of days. That also
> > > > happens to be roughly the time frame in which our source.squeak.org service
> > > > got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with
> > > > source.squeak.org, so it is possible that the two problems are related.
> > > >
> > > > I noticed this because the InterpreterVM and CogVM jobs are failing after
> > > > their watchdog timers expire, but the actual jobs succeed if I run them on
> > > > my own local PC. Those jobs run Squeak at low priority (nice) and it is
> > > > possible that their failures are due to the runaway ruby job consuming all
> > > > available resource.
> > > >
> > > > I have not killed the runaway process yet, in case anyone wants to have a
> > > > look at first.
> > > 
> > > I would be happy if you attached strace to it, collected some data,
> > > and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear
> > > you know that, but I just had to point it out.) Hopefully the strace
> > > would give enough clues to find what looks like a tight loop...
> > >
> > 
> > We don't have strace installed on box3, sorry.
> > 
> > Dave


More information about the Box-Admins mailing list