Long running build process on build.squeak.org?

List overview All Threads
Download

newer

older

New Jenkins users

Re: [Box-Admins] Long running...

Ken Causey

13 Jan 2013 13 Jan '13

8:10 p.m.

Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Show replies by date

Ken Causey

13 Jan 13 Jan

8:37 p.m.

Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...

Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Frank Shearar

9:18 p.m.

I just killed the job. I'll need to add more output to the script, like the precise Cog version involved. I expect that particular job to be less stable than SqueakTrunk - it _is_ bleeding edge on both image _and_ VM side, after all.

frank

On 13 January 2013 19:37, Ken Causey ken@kencausey.com wrote:

...

Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Ken Causey

9:26 p.m.

Great, also I think I should point out that I don't think it was just that an exception had not been caught. The process was pegging the CPU (running full out, 99%+ CPU usage).

Ken

On 01/13/2013 02:18 PM, Frank Shearar wrote:

...

I just killed the job. I'll need to add more output to the script, like the precise Cog version involved. I expect that particular job to be less stable than SqueakTrunk - it _is_ bleeding edge on both image _and_ VM side, after all.

frank

On 13 January 2013 19:37, Ken Causeyken@kencausey.com wrote:

...
Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Frank Shearar

14 Jan 14 Jan

9:06 a.m.

Ah, no, that's not a debugger then.

I'm going to slap a 15 minute kill time on the jobs later today: our longest running jobs so far are around 9 minutes.

frank

On 13 January 2013 20:26, Ken Causey ken@kencausey.com wrote:

...

Great, also I think I should point out that I don't think it was just that an exception had not been caught. The process was pegging the CPU (running full out, 99%+ CPU usage).

Ken

On 01/13/2013 02:18 PM, Frank Shearar wrote:

...
I just killed the job. I'll need to add more output to the script, like the precise Cog version involved. I expect that particular job to be less stable than SqueakTrunk - it _is_ bleeding edge on both image _and_ VM side, after all.

frank

On 13 January 2013 19:37, Ken Causeyken@kencausey.com wrote:

...
Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

David T. Lewis

2:24 p.m.

Good idea to add a watchdog timer. Another good practice is to use the 'nice' command (/usr/bin/nice) in the command lines that run Squeak. This runs the tests at lower scheduling priority, so if a process gets stuck consuming close to 100% cpu, it impact on other system users will be reduced (it will still gobble up all the cpu, it just won't drag the system down so badly).

I don't know what the problem was in this particular case, but one thing that can result in Squeak consuming 100% is an error in the image that causes too much memory usage, such as a recursion error. Squeak keeps asking for more memory, the VM asks the OS for more, and eventually you are swapping. If this turns out to have been the problem, you can prevent the runaway memory condition with the '-memory' command line option to the VM (but don't do that unless we can confirm that it really *is* the problem, I'm just mentioning it for future reference).

Dave

On Mon, Jan 14, 2013 at 08:06:27AM +0000, Frank Shearar wrote:

...

Ah, no, that's not a debugger then.

I'm going to slap a 15 minute kill time on the jobs later today: our longest running jobs so far are around 9 minutes.

frank

On 13 January 2013 20:26, Ken Causey ken@kencausey.com wrote:

...
Great, also I think I should point out that I don't think it was just that an exception had not been caught. The process was pegging the CPU (running full out, 99%+ CPU usage).

Ken

On 01/13/2013 02:18 PM, Frank Shearar wrote:

...
I just killed the job. I'll need to add more output to the script, like the precise Cog version involved. I expect that particular job to be less stable than SqueakTrunk - it _is_ bleeding edge on both image _and_ VM side, after all.

frank

On 13 January 2013 19:37, Ken Causeyken@kencausey.com wrote:

...
Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Frank Shearar

2:33 p.m.

On 14 January 2013 13:24, David T. Lewis lewis@mail.msen.com wrote:

...

Good idea to add a watchdog timer. Another good practice is to use the 'nice' command (/usr/bin/nice) in the command lines that run Squeak. This runs the tests at lower scheduling priority, so if a process gets stuck consuming close to 100% cpu, it impact on other system users will be reduced (it will still gobble up all the cpu, it just won't drag the system down so badly).

I don't know what the problem was in this particular case, but one thing that can result in Squeak consuming 100% is an error in the image that causes too much memory usage, such as a recursion error. Squeak keeps asking for more memory, the VM asks the OS for more, and eventually you are swapping. If this turns out to have been the problem, you can prevent the runaway memory condition with the '-memory' command line option to the VM (but don't do that unless we can confirm that it really *is* the problem, I'm just mentioning it for future reference).

It's a repeatable problem, at least: http://squeakci.org/job/SqueakTrunkOnBleedingEdgeCog/17/console. I haven't had a chance to add debug info though.

frank

...

Dave

On Mon, Jan 14, 2013 at 08:06:27AM +0000, Frank Shearar wrote:

...
Ah, no, that's not a debugger then.

I'm going to slap a 15 minute kill time on the jobs later today: our longest running jobs so far are around 9 minutes.

frank

On 13 January 2013 20:26, Ken Causey ken@kencausey.com wrote:

...
Great, also I think I should point out that I don't think it was just that an exception had not been caught. The process was pegging the CPU (running full out, 99%+ CPU usage).

Ken

On 01/13/2013 02:18 PM, Frank Shearar wrote:

...
I just killed the job. I'll need to add more output to the script, like the precise Cog version involved. I expect that particular job to be less stable than SqueakTrunk - it _is_ bleeding edge on both image _and_ VM side, after all.

frank

On 13 January 2013 19:37, Ken Causeyken@kencausey.com wrote:

...
Sorry, that process line was unintentionally chopped off

jenkins 29126 99.6 2.3 1054380 24552 ? R 03:20 1032:16 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/tests.st

Ken

On 01/13/2013 01:10 PM, Ken Causey wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null

/var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

Ken Causey

5:32 p.m.

On 01/14/2013 07:24 AM, David T. Lewis wrote:

...

Good idea to add a watchdog timer. Another good practice is to use the 'nice' command (/usr/bin/nice) in the command lines that run Squeak. This runs the tests at lower scheduling priority, so if a process gets stuck consuming close to 100% cpu, it impact on other system users will be reduced (it will still gobble up all the cpu, it just won't drag the system down so badly).

I don't know what the problem was in this particular case, but one thing that can result in Squeak consuming 100% is an error in the image that causes too much memory usage, such as a recursion error. Squeak keeps asking for more memory, the VM asks the OS for more, and eventually you are swapping. If this turns out to have been the problem, you can prevent the runaway memory condition with the '-memory' command line option to the VM (but don't do that unless we can confirm that it really *is* the problem, I'm just mentioning it for future reference).

Dave

I could be misunderstanding but I don't think this is it. From the ps output: '1054380 24552'. These are respectively the VSZ (kbytes, virtual memory size) and RSS (kilobytes, resident set size). The relevant number being the second and which is really quite low for a Squeak process, while still being within the expected range of course.

Ken

Frank Shearar

6:11 p.m.

On 14 January 2013 16:32, Ken Causey ken@kencausey.com wrote:

...

On 01/14/2013 07:24 AM, David T. Lewis wrote:

...
Good idea to add a watchdog timer. Another good practice is to use the 'nice' command (/usr/bin/nice) in the command lines that run Squeak. This runs the tests at lower scheduling priority, so if a process gets stuck consuming close to 100% cpu, it impact on other system users will be reduced (it will still gobble up all the cpu, it just won't drag the system down so badly).

I don't know what the problem was in this particular case, but one thing that can result in Squeak consuming 100% is an error in the image that causes too much memory usage, such as a recursion error. Squeak keeps asking for more memory, the VM asks the OS for more, and eventually you are swapping. If this turns out to have been the problem, you can prevent the runaway memory condition with the '-memory' command line option to the VM (but don't do that unless we can confirm that it really *is* the problem, I'm just mentioning it for future reference).

Dave

I could be misunderstanding but I don't think this is it. From the ps output: '1054380 24552'. These are respectively the VSZ (kbytes, virtual memory size) and RSS (kilobytes, resident set size). The relevant number being the second and which is really quite low for a Squeak process, while still being within the expected range of course.

Ken

I looked at what logging _is_ in place in that script, and we're not even hitting the tests themselves. We're hanging during this: FileDirectory default fullNameFor: 'HudsonBuildTools.st'

in tests.st.

frank

Frank Shearar

15 Jan 15 Jan

3:37 p.m.

It also helps a whole lot if one writes things correctly. I broke the tests.st script by logging incorrectly. I'm going to have to fix this log errors to console thing.

frank

On 14 January 2013 17:11, Frank Shearar frank.shearar@gmail.com wrote:

...

On 14 January 2013 16:32, Ken Causey ken@kencausey.com wrote:

...
On 01/14/2013 07:24 AM, David T. Lewis wrote:

...
Good idea to add a watchdog timer. Another good practice is to use the 'nice' command (/usr/bin/nice) in the command lines that run Squeak. This runs the tests at lower scheduling priority, so if a process gets stuck consuming close to 100% cpu, it impact on other system users will be reduced (it will still gobble up all the cpu, it just won't drag the system down so badly).

I don't know what the problem was in this particular case, but one thing that can result in Squeak consuming 100% is an error in the image that causes too much memory usage, such as a recursion error. Squeak keeps asking for more memory, the VM asks the OS for more, and eventually you are swapping. If this turns out to have been the problem, you can prevent the runaway memory condition with the '-memory' command line option to the VM (but don't do that unless we can confirm that it really *is* the problem, I'm just mentioning it for future reference).

Dave

I could be misunderstanding but I don't think this is it. From the ps output: '1054380 24552'. These are respectively the VSZ (kbytes, virtual memory size) and RSS (kilobytes, resident set size). The relevant number being the second and which is really quite low for a Squeak process, while still being within the expected range of course.

Ken

I looked at what logging _is_ in place in that script, and we're not even hitting the tests themselves. We're hanging during this: FileDirectory default fullNameFor: 'HudsonBuildTools.st'

in tests.st.

frank

Frank Shearar

13 Jan 13 Jan

8:44 p.m.

On 13 January 2013 19:10, Ken Causey ken@kencausey.com wrote:

...

Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

It's probably a process that's brought up a debugger. I'd really like to adjust the UIManager to do things like dump stack traces to console, and kill processes in the event of failed tests. I'm happy for anyone to kill these kinds of jobs. I don't always notice them.

frank

Ken Causey

9:20 p.m.

On 01/13/2013 01:44 PM, Frank Shearar wrote:

...

On 13 January 2013 19:10, Ken Causeyken@kencausey.com wrote:

...
Roughly every day or two I login to box3 and check things out and check for package updates. With rare exception the system is quiet, I check for updates, apply any found, and move on. But today I find this (from ps auwx)

jenkins 29126 99.7 2.3 1054380 24552 ? R 03:20 1000:40 /var/lib/jenkins/workspace/CogVM/tmp/lib/squeak/4.0-2636/squeak -vm-sound-null -vm-display-null /var/lib/jenkins/workspace/SqueakTrunkOnBleedingEdgeCog/target/TrunkImage.image /var/lib/jenkins/workspace/Sq

As you can see this has used 1000+ minutes of CPU time (which is less than the actual running time). I've not seen this before on the server. Is it perhaps the result of a new build project and expected? Or an actual problem? Out of caution and since the system is already busy I haven't checked for package updates yet today (I think the last time I did so was Friday).

Ken

It's probably a process that's brought up a debugger. I'd really like to adjust the UIManager to do things like dump stack traces to console, and kill processes in the event of failed tests. I'm happy for anyone to kill these kinds of jobs. I don't always notice them.

frank

OK, can you give some guidance as to how long is too long for the build processes? Does it vary much based on the job?

Ken

4126

Age (days ago)

4128

Last active (days ago)

box-admins@lists.squeakfoundation.org

11 comments

3 participants

tags (0)

participants (3)

David T. Lewis
Frank Shearar
Ken Causey