[Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Mariano Martinez Peck marianopeck at gmail.com
Mon Dec 19 13:41:20 UTC 2016


Hi guys,

There is a bug happening in OSSubprocess since quite some time which we
were never able to fix. What happens is that at certain point (after having
used OSSubprocess for a while), a simple command like the following would
"hang" Pharo:

OSSUnixSubprocess new
    command: 'ls';
    arguments: #();
    redirectStdout;
    runAndWaitOnExitDo: [ :process :outString  | outString ]

Once the image does that, it keeps doing it in a 80% of the cases (so it is
easy enough to debug).

Guille Polito kept one of these images if someone can give us a hand. He
also proposed the great idea of using `strace` to see what was going on. He
 (together with Pable Tesone) suspected that the heartbeat could be
interrupting the `clone()` function which is (I think) called internally by
the 'posix_spawn()' which is the one used by OSSubprocess.

When these images are "hung" they found at a infinitive loop like this:

[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 606282}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 612528}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 618689}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 624922}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 631145}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 637241}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 643578}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 649974}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

As you can see, there is a SIGALARM involved. It also looks like the
`gettimeofday` is used by the heartbeat ?  Could it be that someone the
heartbeat is interrupting the `clone()` ?

Guille also showed me the `strace` output with a regular / working image:


[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
[pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)
[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0,
1917}}) = 0
[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily
unavailable)
[pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0,
797}}) = 0
[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily
unavailable)
[pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be
restarted if no handler)
[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---


Does anyone have any hint here?

Thanks in advance,


-- 
Mariano
http://marianopeck.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20161219/209399c9/attachment.html>


More information about the Vm-dev mailing list