[Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

David T. Lewis lewis at mail.msen.com
Fri Nov 8 23:06:42 UTC 2013


On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
> Trunk stopped responding again.  box2 might need to be rebooted again.

We have a ruby process running on box3 that is consuming all available CPU,
that has been reparented to init, and that has been running for a long time.

  jenkins at box3-squeak:~$ ps -aef | grep ruby
  jenkins  19054 18955  0 22:48 pts/1    00:00:00 grep ruby
  jenkins  28923     1 87 Nov06 ?        1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb
  jenkins at box3-squeak:~$ ps -l -p 28923
  F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
  0 R   103 28923     1 87  80   0 -  5205 -      ?        1-19:18:19 ruby
  jenkins at box3-squeak:~$ top -p 28923 -b -n 1
  top - 22:48:47 up 199 days,  7:38,  2 users,  load average: 7.11, 7.11, 7.16
  Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
  Cpu(s):  2.0%us,  0.2%sy,  2.0%ni, 95.3%id,  0.4%wa,  0.0%hi,  0.0%si,  0.1%st
  Mem:   1032140k total,  1009656k used,    22484k free,    74156k buffers
  Swap:   524280k total,    10732k used,   513548k free,   209540k cached
  
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  28923 jenkins   20   0 20820  12m  828 R 87.0  1.3   2598:23 ruby

This appears to be a process that got disconnected from one of our Jenkins
jobs and has been stuck burning cpu for that last couple of days. That also
happens to be roughly the time frame in which our source.squeak.org service
got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with
source.squeak.org, so it is possible that the two problems are related.

I noticed this because the InterpreterVM and CogVM jobs are failing after
their watchdog timers expire, but the actual jobs succeed if I run them on
my own local PC. Those jobs run Squeak at low priority (nice) and it is
possible that their failures are due to the runaway ruby job consuming all
available resource.

I have not killed the runaway process yet, in case anyone wants to have a
look at first.

Dave



More information about the Box-Admins mailing list