[Vm-dev] Re: [squeak-dev] Re: Socket clock rollover issues

Wed May 6 02:09:29 UTC 2009

John M McIntosh wrote:
> So is this on windows, or unix?

Windows Qwaq Forums client.

> So how did you measure that?

Our original issue was that Python apps in Forums would "stop working" 
after running for hours. The way these apps work is by calling from 
Forums to Python and have callback facilities that allow Python code to 
invoke methods inside Forums. When the apps stopped working I could 
observe that it was when a callback was being executed, i.e., from the 
Python side everything was set up and the VM had entered the interpreter 
loop again. Except that the Python callback semaphore wasn't signaled.

I then changed that code to use waitTimeout: and count the number of 
times the callback semaphore was signaled (i.e., didn't time out) vs. 
the number of times we had callback data waiting. These numbers should 
be exactly the same and they weren't.

Since all of this is code that is under our control it means that I am 
100% certain that we've been calling signalSemaphoreWithIndex() and that 
this wasn't delivered to the image. And obviously it's not a common 
event (2 out of 400k callbacks missed the signal).

> I think you said you had a VM that does proper locking of the queues? 

Yes. I don't think that's the problem. Right now my theory is that we're 
indeed overflowing the VMs semaphore buffer because a Python callout may 
take a long, long time. I think what happens then might be that over the 
period of time the (few) sockets generate multiple semaphore signals 
which overflows the VMs buffer and then there is no room left in the 
buffer when the callback executes.

If that's true then I should be able to recreate the problem by calling 
an OS-level sleep() function via FFI (i.e., block the main interpreter 
loop) while performing heavy network activity and see if that overflows 
the VMs buffer.

And if that's indeed the case then I think there are two actions to 
take: One is to fix the Windows sockets code to not do that ;-) (i.e., 
not signal an already signaled semaphore a gazillion times) but also to 
keep track of the number of signals on a particular semaphore instead of 
keeping an entry in the buffer each time the semaphore is signaled 
(which would completely solve this type of problem in general).

The next step for me will be to attribute our Python callback facilities 
to keep track of the time that's passed between entering Python and 
getting back to Forums and see if that correlates. Plus doing the 
sleep() test via FFI to see how long this needs to take before we 
overflow the VM buffer.

Cheers,
   - Andreas

> On 5-May-09, at 5:52 PM, Andreas Raab wrote:
> 
>> Folks -
>>
>> Just as a follow-up to this note I now have proof that we're loosing 
>> semaphore signals occasionally. What I was able to detect was that 
>> when running forums over a period of 20 hours we lost 2 out of 421355 
>> signals. We'll have the follow-on discussion on vm-dev since I don't 
>> think most people here are interested in discussing the possibilities 
>> of how this could happen and what to do about it. Please send any 
>> follow-ups to vm-dev (and vm-dev only).
>>
>> Cheers,
>>  - Andreas
> 
> -- 
> ===========================================================================
> John M. McIntosh <johnmci at smalltalkconsulting.com>   Twitter:  
> squeaker68882
> Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
> ===========================================================================
> 
> 
> 
>