On 8 October 2013 17:48, David T. Lewis lewis@mail.msen.com wrote:
On Tue, Oct 08, 2013 at 04:33:28PM +0100, Frank Shearar wrote:
On 6 October 2013 19:00, David T. Lewis lewis@mail.msen.com wrote:
On Sun, Oct 06, 2013 at 11:18:25AM -0400, David T. Lewis wrote:
On Sun, Oct 06, 2013 at 04:52:21PM +0200, Tobias Pape wrote:
So, uptime said: root@box3-squeak:/home/ssdotcom# uptime 16:32:49 up 166 days, 22 min, 1 user, load average: 0.70, 4.74, 8.52
And the _last two_ numbers are concerning. Basically, the server was overloaded. the Squeak vm uses about a gig of virtual memory (really?) and seems to compete with the jenkins running on the server. htop says, jenkins uses 25% of the systems memory while Squeak uses 19% (both of which I deem high). So in the event of some jenkins jobs firing off and Squeaksource answering some requests, the server might become un-responsive?
Something like that I think. I'm not sure what was generating the load, although there is no question that adding squeaksource to box3 adds a significant resource demand above that of the Jenkins jobs.
Allocating a big address space (1G) is normal for the VM, and in this case the image is actually using a bit under 200MB, which is 20% of the system memory. If there is some combination of squeaksource and jenkins activity that pushes the total demand to the point of requiring swapping, then it's possible that this would make the system unresponsive as I was seeing.
A number of the Jenkins jobs run squeak VMs in addition to the Java stuff, so some combination of these might add up to a problem.
I am now running top every 30 seconds for the next 24 hours, with output directed to ~ssdotcom/tmp/top.out. Possibly this will show us something interesting.
If that's still running, it's probably saying "ow! ow! stop it!" right now. If Tony Garnock-Jones & I could figure out why jobs are failing on his slaves, I'd suggest moving builds off the box entirely. I'll probably turn my old laptop into a build slave... once I can get it up & running again. That too will help with keeping work off the box.
No worries, I only ran it for a 24 hour period. I saw occasional load increases, but nothing like the "load average: 0.70, 4.74, 8.52" that Tobias spotted right after the outage. I think we just need to keep our eyes open for problems in case it comes back ... usually if a thing can fail once, it will fail again eventually ;-)
Hm, OK, so build.squeak.org could be unavailable for a different reason!
frank
Dave