[Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Holger Freyther holger at freyther.de
Mon Dec 19 16:55:28 UTC 2016


> On 19 Dec 2016, at 14:41, Mariano Martinez Peck <marianopeck at gmail.com> wrote:
> 
> Hi guys, 
> 
> Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He  (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.
> 
> When these images are "hung" they found at a infinitive loop like this:

Okay, how many child processes do you have at that point? How many processes does the system have?


> 
> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
> [pid 17477] sigreturn() (mask [])       = 120
> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
> [pid 17477] sigreturn() (mask [])       = 120
> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)

so above.. 7ms between the two gettimeofday calls? Nothing else? Set a breakpoint on clone/fork in gdb and look at the c-stack at this point? Could you strace with timestamps to see how much time is spent? Is the process suspicious in other ways?

So yes.. sounds like clone doesn't complete.. the question is why? Is it out of resources? Is something in the VM blocking longer than the heartbeat, is the heartbeat more frequent than expected?


> As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ?  Could it be that someone the heartbeat is interrupting the `clone()` ?
> 
> Guille also showed me the `strace` output with a regular / working image:
> 
> 
> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
> [pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)
> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0
> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0
> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler)
> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> 
> 
> Does anyone have any hint here?  

Get timestamps in there. How long does it take to fail/end in this situation?

holger





More information about the Vm-dev mailing list