[Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

Sat Nov 9 00:23:17 UTC 2013

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:
> On 8 November 2013 23:06, David T. Lewis <lewis at mail.msen.com> wrote:
> > On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
> >> Trunk stopped responding again.  box2 might need to be rebooted again.
> >
> > We have a ruby process running on box3 that is consuming all available CPU,
> > that has been reparented to init, and that has been running for a long time.
> >
> >   jenkins at box3-squeak:~$ ps -aef | grep ruby
> >   jenkins  19054 18955 0 22:48 pts/1    00:00:00 grep ruby
> >   jenkins  28923     1 87 Nov06 ?        1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
> >   jenkins at box3-squeak:~$ ps -l -p 28923
> >   F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> >   0 R   103 28923     1 87  80   0 -  5205 -      ?        1-19:18:19 ruby
> >   jenkins at box3-squeak:~$ top -p 28923 -b -n 1
> >   top - 22:48:47 up 199 days,  7:38,  2 users,  load average: 7.11, 7.11, 7.16
> >   Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> >   Cpu(s):  2.0%us,  0.2%sy,  2.0%ni, 95.3%id,  0.4%wa,  0.0%hi,  0.0%si,  0.1%st
> >   Mem:   1032140k total,  1009656k used,    22484k free,    74156k buffers
> >   Swap:   524280k total,    10732k used,   513548k free,   209540k cached
> >
> >     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >   28923 jenkins   20   0 20820  12m  828 R 87.0  1.3   2598:23 ruby
> >
> > This appears to be a process that got disconnected from one of our Jenkins
> > jobs and has been stuck burning cpu for that last couple of days. That also
> > happens to be roughly the time frame in which our source.squeak.org service
> > got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with
> > source.squeak.org, so it is possible that the two problems are related.
> >
> > I noticed this because the InterpreterVM and CogVM jobs are failing after
> > their watchdog timers expire, but the actual jobs succeed if I run them on
> > my own local PC. Those jobs run Squeak at low priority (nice) and it is
> > possible that their failures are due to the runaway ruby job consuming all
> > available resource.
> >
> > I have not killed the runaway process yet, in case anyone wants to have a
> > look at first.
> 
> I would be happy if you attached strace to it, collected some data,
> and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear
> you know that, but I just had to point it out.) Hopefully the strace
> would give enough clues to find what looks like a tight loop...
>

We don't have strace installed on box3, sorry.

Dave