Trunk stopped responding again. box2 might need to be rebooted again.
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
Trunk stopped responding again. box2 might need to be rebooted again.
We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.
jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby
This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.
I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.
I have not killed the runaway process yet, in case anyone wants to have a look at first.
Dave
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
Trunk stopped responding again. box2 might need to be rebooted again.
We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.
jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby
This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.
I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.
I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
frank
Dave
On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:
Trunk stopped responding again. box2 might need to be rebooted again.
We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.
jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby
This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.
I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.
I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
We don't have strace installed on box3, sorry.
Dave
box-admins@lists.squeakfoundation.org