[Vm-dev] [Pharo-dev] Problem with OSSubprocess / signals / heartbeat ?

Guille Polito guillermopolito at gmail.com
Wed Dec 21 15:29:35 UTC 2016


Ok, so following a bit on this. I'll summarize some of our findings, 
some of them maybe obvious for some people in this list.

I saw that there are actually two different kind of VMs for *nix [1] :
  - threaded heartbeat
  - itimer + signal heartbeat

Specially, I'd like to cite the following paragraph for the lazy:

    A distinction on linux is between VMs with an itimer hearbeat or a
    threaded heartbeat. VMs with an itimer hearbeat use setitimer to
    deliver a SIGALRM signal at regular intervals to interrupt the VM to
    check for events. These signals can be troublesome, interrupting
    foreign code that cannot cope with such signals. VMs with a threaded
    heartbeat use a high-priority thread that loops, blocking on
    nanosleep and then interrupting the VM, performing the same function
    as the itimer heartbeat but without using signals. These VMs are to
    be preferred but suport for multiple thread priorities in user-level
    processes has only been available on linux in kernels later than 2.6.12.

So, I downloaded the heartbeat squeak VM from bintray [2]. This VM 
requires so deploy some configuration files in /etc/security [2].

Under this configuration OSSubprocess worked like a charm (or I did not 
found the issue again so far).

Now, this heartbeat threaded VM is the recommended in the README file, 
and we see that OSSubprocess generates the exact issue stated. The main 
problem remains for the moment since Pharo's default download includes 
not this VM but the itimer one. I talked with Esteban about it and he 
was aware of these two VM flavours, and the reason why we are using the 
itimer one is the need to deploy those permission files in /etc, which 
makes installation a bit less automatic.

Cheers,
Guille

[1] 
https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/e17db79411cfec767e04f3d94d12a642d920a30e/build.linux64x64/HowToBuild
[2] 
https://bintray.com/opensmalltalk/vm/download_file?file_path=cog_linux32x86_squeak.sista.spur_201612170124.tar.gz
[3] 
https://github.com/OpenSmalltalk/opensmalltalk-vm/releases/tag/r3732#linux


-------- Original Message --------
>
>
>> I asked Santa an OSSubprocess that won't hang :)
>>
>
>
> I really appreciate to see such not so small polar elfves helping santa :)
>
>
>
> Stef
>
>
>
>
>
>
>
>> Thanks,
>>
>> Guille
>>
>>
>>
>>
>>
>> -------- Original Message --------
>>
>>>
>>>
>>>> On 19 Dec 2016, at 14:41, Mariano Martinez Peck 
>>>> <marianopeck at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi guys,
>>>>
>>>>
>>>>
>>>> Guille Polito kept one of these images if someone can give us a 
>>>> hand. He also proposed the great idea of using `strace` to see what 
>>>> was going on. He  (together with Pable Tesone) suspected that the 
>>>> heartbeat could be interrupting the `clone()` function which is (I 
>>>> think) called internally by the 'posix_spawn()' which is the one 
>>>> used by OSSubprocess.
>>>>
>>>>
>>>>
>>>> When these images are "hung" they found at a infinitive loop like this:
>>>>
>>> Okay, how many child processes do you have at that point? How many 
>>> processes does the system have?
>>>
>>>
>>>
>>>
>>>
>>>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>>>
>>>> [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
>>>>
>>>> [pid 17477] sigreturn() (mask [])       = 120
>>>>
>>>> [pid 17477] clone(child_stack=0, 
>>>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
>>>> child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
>>>>
>>>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>>>
>>>> [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
>>>>
>>>> [pid 17477] sigreturn() (mask [])       = 120
>>>>
>>>> [pid 17477] clone(child_stack=0, 
>>>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
>>>> child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
>>>>
>>> so above.. 7ms between the two gettimeofday calls? Nothing else? Set 
>>> a breakpoint on clone/fork in gdb and look at the c-stack at this 
>>> point? Could you strace with timestamps to see how much time is 
>>> spent? Is the process suspicious in other ways?
>>>
>>>
>>>
>>> So yes.. sounds like clone doesn't complete.. the question is why? 
>>> Is it out of resources? Is something in the VM blocking longer than 
>>> the heartbeat, is the heartbeat more frequent than expected?
>>>
>>>
>>>
>>>
>>>
>>>> As you can see, there is a SIGALARM involved. It also looks like 
>>>> the `gettimeofday` is used by the heartbeat ?  Could it be that 
>>>> someone the heartbeat is interrupting the `clone()` ?
>>>>
>>>>
>>>>
>>>> Guille also showed me the `strace` output with a regular / working 
>>>> image:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>>>
>>>> [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
>>>>
>>>> [pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted 
>>>> system call)
>>>>
>>>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, 
>>>> it_value={0, 1917}}) = 0
>>>>
>>>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource 
>>>> temporarily unavailable)
>>>>
>>>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
>>>>
>>>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, 
>>>> it_value={0, 797}}) = 0
>>>>
>>>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource 
>>>> temporarily unavailable)
>>>>
>>>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND 
>>>> (To be restarted if no handler)
>>>>
>>>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Does anyone have any hint here?
>>>>
>>> Get timestamps in there. How long does it take to fail/end in this 
>>> situation?
>>>
>>>
>>>
>>> holger
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20161221/a5f35e98/attachment-0001.html>


More information about the Vm-dev mailing list