Hi guys,
There is a bug happening in OSSubprocess since quite some time which we were never able to fix. What happens is that at certain point (after having used OSSubprocess for a while), a simple command like the following would "hang" Pharo:
OSSUnixSubprocess new command: 'ls'; arguments: #(); redirectStdout; runAndWaitOnExitDo: [ :process :outString | outString ]
Once the image does that, it keeps doing it in a 80% of the cases (so it is easy enough to debug).
Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.
When these images are "hung" they found at a infinitive loop like this:
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 606282}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 612528}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 618689}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 624922}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 631145}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 637241}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 643578}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 17477] gettimeofday({1482152630, 649974}, NULL) = 0 [pid 17477] sigreturn() (mask []) = 120 [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted) [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ? Could it be that someone the heartbeat is interrupting the `clone()` ?
Guille also showed me the `strace` output with a regular / working image:
[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0 [pid 18647] sigreturn() (mask []) = -1 EINTR (Interrupted system call) [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0 [pid 18647] recvmsg(3, 0xff7b0734, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout) [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0 [pid 18647] recvmsg(3, 0xff7b0734, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler) [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
Does anyone have any hint here?
Thanks in advance,