[Vm-dev] Socket's readSemaphore is losing signals with Cog on Linux

Eliot Miranda eliot.miranda at gmail.com
Mon Aug 15 18:10:05 UTC 2011


Thanks, Levente (and Colin for the reproducible case).  I should be able to
look at this towards the end of the week.  Anyone else who wants to eyeball
aio.c in the Cog branch against aio.c in the trunk vm is most welcome.

On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi <leves at elte.hu> wrote:

>
> On Sun, 14 Aug 2011, Andreas Raab wrote:
>
>
>> On 8/13/2011 13:42, Levente Uzonyi wrote:
>>
>>> Socket's readSemaphore is losing signals with CogVMs on linux. We found
>>> several cases (RFB, PostgreSQL) when processes are stuck in the following
>>> method:
>>>
>>> Socket >> waitForDataIfClosed: closedBlock
>>>    "Wait indefinitely for data to arrive.  This method will block until
>>>    data is available or the socket is closed."
>>>
>>>    [
>>>        (self primSocketReceiveDataAvailable**: socketHandle)
>>>            ifTrue: [^self].
>>>        self isConnected
>>>            ifFalse: [^closedBlock value].
>>>        self readSemaphore wait ] repeat
>>>
>>> When we inspect the contexts, the process is waiting for the
>>> readSemaphore, but evaluating (self primSocketReceiveDataAvailable**:
>>> socketHandle) yields true. Signaling the readSemaphore makes the process
>>> running again. As a workaround we replaced #wait with #waitTimeoutMSecs: and
>>> all our problems disappeared.
>>>
>>> The interpreter VM doesn't seem to have this bug, so I guess the bug was
>>> introduced with the changes of aio.c.
>>>
>>
>> Oh, interesting. We know this problem fairly well and have always worked
>> around by changing the wait in the above to a "waitTimeoutMSecs: 500" which
>> turns it into a soft busy loop. It would be interesting to see if there's a
>>
>
> It took a while for us to realize that _this_ bug is responsible for our
> problems. With RFB we found that the server doesn't accept input from the
> client, while it's still sending the changes of the view when the bug
> happens, which is every few hours. We thought that it's the side effect of
> some changes in recent Squeak versions and we just didn't care about it.
> Restarting the RFB client can be done in a second.
> With PostgreSQL we thought that our Postgres V3 client has a bug. Our old
> system uses Postgres V2 client, Seaside 2.8, Squeak 3.9 and interpreter VM
> and it didn't have such problem for years.
> We recently started migrating it to Postgres V3, Custom web framework,
> Squeak 4.2 and CogVM.
> The main differences between these system are interpreter VM - CogVM and
> Postgres V2 - V3. We assumed that Cog is identical from this POV and tried
> to debug the postgres protocol, but when I saw where the processes got
> stalled I remembered your email from 2009 when you mentioned that you had a
> similar bug [1].
> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
> pretty hard, so a code review (with sufficient knowledge :)) is more likely
> to help solving this issue.
>
>
> Levente
>
> [1] http://lists.squeakfoundation.**org/pipermail/vm-dev/2009-May/**
> 002619.html<http://lists.squeakfoundation.org/pipermail/vm-dev/2009-May/002619.html>
>
>
>  bug in Cog which causes this. FWIW, here is the relevant portion:
>>
>>           "Soft 500ms busy loop - to protect against AIO probs;
>>           occasionally, VM-level AIO fails to trip the semaphore"
>>           self readSemaphore waitTimeoutMSecs: 500.
>>
>> Cheers,
>>  - Andreas
>>
>>
>>


-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20110815/0fd7043c/attachment.htm


More information about the Vm-dev mailing list