RE: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

List overview All Threads
Download

newer

older

understanding remote image viewing...

Problems sending email to some...

Ken Causey

9 Nov 2013 9 Nov '13

1:35 a.m.

strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...

-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:

...
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:

...
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:

...
Trunk stopped responding again. box2 might need to be rebooted again.

We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.

jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby

This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.

I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.

I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
We don't have strace installed on box3, sorry.

Dave

Show replies by date

David T. Lewis

9 Nov 9 Nov

3:16 a.m.

New subject: Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

Thanks Ken,

The runaway process is producing no strace output at all, so it is not making any system calls. According to /proc/28923/status there is a very large amount of context switching going on:

nonvoluntary_ctxt_switches: 61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:

...

strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...
-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:

...
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:

...
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:

...
Trunk stopped responding again. box2 might need to be rebooted again.

We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.

jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby

This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.

I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.

I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
We don't have strace installed on box3, sorry.

Dave

David T. Lewis

5:44 a.m.

New subject: Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

I killed the runaway ruby process, and confirmed that the InterpreterVM job once again runs successfully.

In addition to the ruby process, there are four reparented squeakvm processes that have been running for a couple of days, although not consuming much CPU. I will kill these also. The four processes are:

UID PID PPID C STIME TTY TIME CMD jenkins 7983 1 0 Nov07 ? 00:21:01 /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspac jenkins 8861 1 0 Nov07 ? 00:20:46 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 8972 1 0 Nov07 ? 00:21:06 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 19136 1 0 Nov07 ? 00:15:27 /var/lib/jenkins/workspace/ExternalPackages-Squeak4.3/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/

Dave

On Fri, Nov 08, 2013 at 09:16:52PM -0500, David T. Lewis wrote:

...

Thanks Ken,

The runaway process is producing no strace output at all, so it is not making any system calls. According to /proc/28923/status there is a very large amount of context switching going on:

nonvoluntary_ctxt_switches: 61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:

...
strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...
-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:

...
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:

...
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote:

...
Trunk stopped responding again. box2 might need to be rebooted again.

We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.

jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby

This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.

I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.

I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
We don't have strace installed on box3, sorry.

Dave

Frank Shearar

10:58 a.m.

New subject: Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

Any process like that - with /var/lib/jenkins/workspace/ in its path - that has been running for more than an hour is a renegade and should be shot on sight.

I expect to see two threads in the Ruby process: one for running the tests, and a watchdog that is _supposed_ to kill long-running jobs.

As it happens, I realised only yesterday that if you nice something in Ruby, through, say, spawn("nice echo 1"), you get the pid of the _nice_, not the pid of the _echo_. If you replace "echo" with "invoke a squeak", then you can't SIGUSR1 the Squeak because you don't know its pid. I removed the nice-ing, in the hope that with the actual pid of the squeak process (as opposed to the pid of the nice, its parent), I could more reliably kill these jobs.

frank

On 9 November 2013 04:44, David T. Lewis lewis@mail.msen.com wrote:

...

I killed the runaway ruby process, and confirmed that the InterpreterVM job once again runs successfully.

In addition to the ruby process, there are four reparented squeakvm processes that have been running for a couple of days, although not consuming much CPU. I will kill these also. The four processes are:

UID PID PPID C STIME TTY TIME CMD jenkins 7983 1 0 Nov07 ? 00:21:01 /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspac jenkins 8861 1 0 Nov07 ? 00:20:46 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 8972 1 0 Nov07 ? 00:21:06 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 19136 1 0 Nov07 ? 00:15:27 /var/lib/jenkins/workspace/ExternalPackages-Squeak4.3/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/

Dave

On Fri, Nov 08, 2013 at 09:16:52PM -0500, David T. Lewis wrote:

...
Thanks Ken,

The runaway process is producing no strace output at all, so it is not making any system calls. According to /proc/28923/status there is a very large amount of context switching going on:

nonvoluntary_ctxt_switches: 61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:

...
strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...
-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:

...
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote:

...
On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote: > Trunk stopped responding again. box2 might need to be rebooted again.

We have a ruby process running on box3 that is consuming all available CPU, that has been reparented to init, and that has been running for a long time.

jenkins@box3-squeak:~$ ps -aef | grep ruby jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb jenkins@box3-squeak:~$ ps -l -p 28923 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby jenkins@box3-squeak:~$ top -p 28923 -b -n 1 top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers Swap: 524280k total, 10732k used, 513548k free, 209540k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby

This appears to be a process that got disconnected from one of our Jenkins jobs and has been stuck burning cpu for that last couple of days. That also happens to be roughly the time frame in which our source.squeak.org service got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with source.squeak.org, so it is possible that the two problems are related.

I noticed this because the InterpreterVM and CogVM jobs are failing after their watchdog timers expire, but the actual jobs succeed if I run them on my own local PC. Those jobs run Squeak at low priority (nice) and it is possible that their failures are due to the runaway ruby job consuming all available resource.

I have not killed the runaway process yet, in case anyone wants to have a look at first.
I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...
We don't have strace installed on box3, sorry.

Dave

David T. Lewis

6:49 p.m.

New subject: Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

On Sat, Nov 09, 2013 at 09:58:33AM +0000, Frank Shearar wrote:

...

Any process like that - with /var/lib/jenkins/workspace/ in its path - that has been running for more than an hour is a renegade and should be shot on sight.

Agreed. I am mainly just reporting these things so that all of us on the list are aware of issues as they arise. I don't think that any specific follow-up is required for any of them, aside from keeping them in mind as we move forward. I have a vague hunch - possibly wrong - that there is some interaction between the issues we saw on source.squeak.org and the issues that showed up on build.squeak.org. I have no idea of the root cause(s) but I expect that if we pay attention to the symptoms we'll put the pieces of the puzzle together eventually.

Dave

...

I expect to see two threads in the Ruby process: one for running the tests, and a watchdog that is _supposed_ to kill long-running jobs.

As it happens, I realised only yesterday that if you nice something in Ruby, through, say, spawn("nice echo 1"), you get the pid of the _nice_, not the pid of the _echo_. If you replace "echo" with "invoke a squeak", then you can't SIGUSR1 the Squeak because you don't know its pid. I removed the nice-ing, in the hope that with the actual pid of the squeak process (as opposed to the pid of the nice, its parent), I could more reliably kill these jobs.

frank

On 9 November 2013 04:44, David T. Lewis lewis@mail.msen.com wrote:

...
I killed the runaway ruby process, and confirmed that the InterpreterVM job once again runs successfully.

In addition to the ruby process, there are four reparented squeakvm processes that have been running for a couple of days, although not consuming much CPU. I will kill these also. The four processes are:

UID PID PPID C STIME TTY TIME CMD jenkins 7983 1 0 Nov07 ? 00:21:01 /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspac jenkins 8861 1 0 Nov07 ? 00:20:46 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 8972 1 0 Nov07 ? 00:21:06 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 19136 1 0 Nov07 ? 00:15:27 /var/lib/jenkins/workspace/ExternalPackages-Squeak4.3/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/

Dave

On Fri, Nov 08, 2013 at 09:16:52PM -0500, David T. Lewis wrote:

...
Thanks Ken,

The runaway process is producing no strace output at all, so it is not making any system calls. According to /proc/28923/status there is a very large amount of context switching going on:

nonvoluntary_ctxt_switches: 61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:

...
strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...
-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote:

...
On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote: > On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote: >> Trunk stopped responding again. box2 might need to be rebooted again. > > We have a ruby process running on box3 that is consuming all available CPU, > that has been reparented to init, and that has been running for a long time. > > jenkins@box3-squeak:~$ ps -aef | grep ruby > jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby > jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb > jenkins@box3-squeak:~$ ps -l -p 28923 > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby > jenkins@box3-squeak:~$ top -p 28923 -b -n 1 > top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 > Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie > Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st > Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers > Swap: 524280k total, 10732k used, 513548k free, 209540k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby > > This appears to be a process that got disconnected from one of our Jenkins > jobs and has been stuck burning cpu for that last couple of days. That also > happens to be roughly the time frame in which our source.squeak.org service > got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with > source.squeak.org, so it is possible that the two problems are related. > > I noticed this because the InterpreterVM and CogVM jobs are failing after > their watchdog timers expire, but the actual jobs succeed if I run them on > my own local PC. Those jobs run Squeak at low priority (nice) and it is > possible that their failures are due to the runaway ruby job consuming all > available resource. > > I have not killed the runaway process yet, in case anyone wants to have a > look at first.

I would be happy if you attached strace to it, collected some data, and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear you know that, but I just had to point it out.) Hopefully the strace would give enough clues to find what looks like a tight loop...

We don't have strace installed on box3, sorry.

Dave

Frank Shearar

8:45 p.m.

New subject: Runaway ruby process on box3 (was: [squeak-dev] trunk down again)

On 9 November 2013 17:49, David T. Lewis lewis@mail.msen.com wrote:

...

On Sat, Nov 09, 2013 at 09:58:33AM +0000, Frank Shearar wrote:

...
Any process like that - with /var/lib/jenkins/workspace/ in its path - that has been running for more than an hour is a renegade and should be shot on sight.

Agreed. I am mainly just reporting these things so that all of us on the list are aware of issues as they arise. I don't think that any specific follow-up is required for any of them, aside from keeping them in mind as we move forward. I have a vague hunch - possibly wrong - that there is some interaction between the issues we saw on source.squeak.org and the issues that showed up on build.squeak.org. I have no idea of the root cause(s) but I expect that if we pay attention to the symptoms we'll put the pieces of the puzzle together eventually.

I can well imagine an interaction between the two, since builds typically follow a pattern of update-some-base-image, test-that-image. "update-some-base-image" of course involves source.squeak.org quite heavily.

frank

...

Dave

...
I expect to see two threads in the Ruby process: one for running the tests, and a watchdog that is _supposed_ to kill long-running jobs.

As it happens, I realised only yesterday that if you nice something in Ruby, through, say, spawn("nice echo 1"), you get the pid of the _nice_, not the pid of the _echo_. If you replace "echo" with "invoke a squeak", then you can't SIGUSR1 the Squeak because you don't know its pid. I removed the nice-ing, in the hope that with the actual pid of the squeak process (as opposed to the pid of the nice, its parent), I could more reliably kill these jobs.

frank

On 9 November 2013 04:44, David T. Lewis lewis@mail.msen.com wrote:

...
I killed the runaway ruby process, and confirmed that the InterpreterVM job once again runs successfully.

In addition to the ruby process, there are four reparented squeakvm processes that have been running for a couple of days, although not consuming much CPU. I will kill these also. The four processes are:

UID PID PPID C STIME TTY TIME CMD jenkins 7983 1 0 Nov07 ? 00:21:01 /var/lib/jenkins/workspace/ReleaseSqueakTrunk/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspac jenkins 8861 1 0 Nov07 ? 00:20:46 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 8972 1 0 Nov07 ? 00:21:06 /var/lib/jenkins/workspace/ExternalPackages/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/ jenkins 19136 1 0 Nov07 ? 00:15:27 /var/lib/jenkins/workspace/ExternalPackages-Squeak4.3/target/Squeak-4.10.2.2614-src-32/bld/squeakvm -vm-sound-null -vm-display-null /var/lib/jenkins/

Dave

On Fri, Nov 08, 2013 at 09:16:52PM -0500, David T. Lewis wrote:

...
Thanks Ken,

The runaway process is producing no strace output at all, so it is not making any system calls. According to /proc/28923/status there is a very large amount of context switching going on:

nonvoluntary_ctxt_switches: 61281919

And the process has two active threads.

I'm going to kill the stuck process and see if that clears up some resource for the other Jenkins jobs.

Dave

On Fri, Nov 08, 2013 at 05:35:47PM -0700, Ken Causey wrote:

...
strace and strace64 are now installed on box3. Of course anyone with sudo access could have done the same.

Ken

...
-------- Original Message -------- Subject: Re: [Box-Admins] Runaway ruby process on box3 (was: [squeak-dev] trunk down again) From: "David T. Lewis" lewis@mail.msen.com Date: Fri, November 08, 2013 6:23 pm To: Squeak Hosting Support box-admins@lists.squeakfoundation.org

On Fri, Nov 08, 2013 at 11:16:34PM +0000, Frank Shearar wrote: > On 8 November 2013 23:06, David T. Lewis lewis@mail.msen.com wrote: > > On Wed, Nov 06, 2013 at 10:02:38PM -0600, Chris Muller wrote: > >> Trunk stopped responding again. box2 might need to be rebooted again. > > > > We have a ruby process running on box3 that is consuming all available CPU, > > that has been reparented to init, and that has been running for a long time. > > > > jenkins@box3-squeak:~$ ps -aef | grep ruby > > jenkins 19054 18955 0 22:48 pts/1 00:00:00 grep ruby > > jenkins 28923 1 87 Nov06 ? 1-19:18:12 /var/lib/jenkins/.rvm/rubies/ruby-1.9.3-p392/bin/ruby -S rspec test/image_test.rb > > jenkins@box3-squeak:~$ ps -l -p 28923 > > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > > 0 R 103 28923 1 87 80 0 - 5205 - ? 1-19:18:19 ruby > > jenkins@box3-squeak:~$ top -p 28923 -b -n 1 > > top - 22:48:47 up 199 days, 7:38, 2 users, load average: 7.11, 7.11, 7.16 > > Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie > > Cpu(s): 2.0%us, 0.2%sy, 2.0%ni, 95.3%id, 0.4%wa, 0.0%hi, 0.0%si, 0.1%st > > Mem: 1032140k total, 1009656k used, 22484k free, 74156k buffers > > Swap: 524280k total, 10732k used, 513548k free, 209540k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 28923 jenkins 20 0 20820 12m 828 R 87.0 1.3 2598:23 ruby > > > > This appears to be a process that got disconnected from one of our Jenkins > > jobs and has been stuck burning cpu for that last couple of days. That also > > happens to be roughly the time frame in which our source.squeak.org service > > got hung up. The Jenkins jobs (e.g. SqueakTrunk) are interacting with > > source.squeak.org, so it is possible that the two problems are related. > > > > I noticed this because the InterpreterVM and CogVM jobs are failing after > > their watchdog timers expire, but the actual jobs succeed if I run them on > > my own local PC. Those jobs run Squeak at low priority (nice) and it is > > possible that their failures are due to the runaway ruby job consuming all > > available resource. > > > > I have not killed the runaway process yet, in case anyone wants to have a > > look at first. > > I would be happy if you attached strace to it, collected some data, > and then killed it. It's a runaway SqueakTrunk job. (Well, it's clear > you know that, but I just had to point it out.) Hopefully the strace > would give enough clues to find what looks like a tight loop... >

We don't have strace installed on box3, sorry.

Dave

3840

Age (days ago)

3840

Last active (days ago)

box-admins@lists.squeakfoundation.org

5 comments

3 participants

tags (0)

participants (3)

David T. Lewis
Frank Shearar
Ken Causey