[Box-Admins] box3.squeak.org off line - HELP neeeded

Tue Oct 8 16:48:53 UTC 2013

On Tue, Oct 08, 2013 at 04:33:28PM +0100, Frank Shearar wrote:
> On 6 October 2013 19:00, David T. Lewis <lewis at mail.msen.com> wrote:
> > On Sun, Oct 06, 2013 at 11:18:25AM -0400, David T. Lewis wrote:
> >> On Sun, Oct 06, 2013 at 04:52:21PM +0200, Tobias Pape wrote:
> >> >
> >> > So, uptime said:
> >> > root at box3-squeak:/home/ssdotcom# uptime
> >> >  16:32:49 up 166 days, 22 min,  1 user,  load average: 0.70, 4.74, 8.52
> >> >
> >> > And the _last two_ numbers are concerning. Basically, the server was overloaded.
> >> > the Squeak vm uses about a gig of virtual memory (really?) and seems to compete with
> >> > the jenkins running on the server. htop says, jenkins uses 25% of the systems memory
> >> > while Squeak uses 19% (both of which I deem high).
> >> >   So in the event of some jenkins jobs firing off and Squeaksource answering some requests,
> >> > the server might become un-responsive?
> >> >
> >>
> >> Something like that I think. I'm not sure what was generating the load, although
> >> there is no question that adding squeaksource to box3 adds a significant resource
> >> demand above that of the Jenkins jobs.
> >>
> >> Allocating a big address space (1G) is normal for the VM, and in this case the
> >> image is actually using a bit under 200MB, which is 20% of the system memory.
> >> If there is some combination of squeaksource and jenkins activity that pushes
> >> the total demand to the point of requiring swapping, then it's possible that
> >> this would make the system unresponsive as I was seeing.
> >>
> >> A number of the Jenkins jobs run squeak VMs in addition to the Java stuff,
> >> so some combination of these might add up to a problem.
> >>
> >
> > I am now running top every 30 seconds for the next 24 hours, with output directed
> > to ~ssdotcom/tmp/top.out. Possibly this will show us something interesting.
> 
> If that's still running, it's probably saying "ow! ow! stop it!" right
> now. If Tony Garnock-Jones & I could figure out why jobs are failing
> on his slaves, I'd suggest moving builds off the box entirely. I'll
> probably turn my old laptop into a build slave... once I can get it up
> & running again. That too will help with keeping work off the box.
> 

No worries, I only ran it for a 24 hour period. I saw occasional load
increases, but nothing like the "load average: 0.70, 4.74, 8.52" that
Tobias spotted right after the outage. I think we just need to keep
our eyes open for problems in case it comes back ... usually if a thing
can fail once, it will fail again eventually ;-)

Dave