[Vm-dev] Re: [squeak-dev] Re: Socket clock rollover issues
andreas.raab at gmx.de
Wed May 6 02:09:29 UTC 2009
John M McIntosh wrote:
> So is this on windows, or unix?
Windows Qwaq Forums client.
> So how did you measure that?
Our original issue was that Python apps in Forums would "stop working"
after running for hours. The way these apps work is by calling from
Forums to Python and have callback facilities that allow Python code to
invoke methods inside Forums. When the apps stopped working I could
observe that it was when a callback was being executed, i.e., from the
Python side everything was set up and the VM had entered the interpreter
loop again. Except that the Python callback semaphore wasn't signaled.
I then changed that code to use waitTimeout: and count the number of
times the callback semaphore was signaled (i.e., didn't time out) vs.
the number of times we had callback data waiting. These numbers should
be exactly the same and they weren't.
Since all of this is code that is under our control it means that I am
100% certain that we've been calling signalSemaphoreWithIndex() and that
this wasn't delivered to the image. And obviously it's not a common
event (2 out of 400k callbacks missed the signal).
> I think you said you had a VM that does proper locking of the queues?
Yes. I don't think that's the problem. Right now my theory is that we're
indeed overflowing the VMs semaphore buffer because a Python callout may
take a long, long time. I think what happens then might be that over the
period of time the (few) sockets generate multiple semaphore signals
which overflows the VMs buffer and then there is no room left in the
buffer when the callback executes.
If that's true then I should be able to recreate the problem by calling
an OS-level sleep() function via FFI (i.e., block the main interpreter
loop) while performing heavy network activity and see if that overflows
the VMs buffer.
And if that's indeed the case then I think there are two actions to
take: One is to fix the Windows sockets code to not do that ;-) (i.e.,
not signal an already signaled semaphore a gazillion times) but also to
keep track of the number of signals on a particular semaphore instead of
keeping an entry in the buffer each time the semaphore is signaled
(which would completely solve this type of problem in general).
The next step for me will be to attribute our Python callback facilities
to keep track of the time that's passed between entering Python and
getting back to Forums and see if that correlates. Plus doing the
sleep() test via FFI to see how long this needs to take before we
overflow the VM buffer.
> On 5-May-09, at 5:52 PM, Andreas Raab wrote:
>> Folks -
>> Just as a follow-up to this note I now have proof that we're loosing
>> semaphore signals occasionally. What I was able to detect was that
>> when running forums over a period of 20 hours we lost 2 out of 421355
>> signals. We'll have the follow-on discussion on vm-dev since I don't
>> think most people here are interested in discussing the possibilities
>> of how this could happen and what to do about it. Please send any
>> follow-ups to vm-dev (and vm-dev only).
>> - Andreas
> John M. McIntosh <johnmci at smalltalkconsulting.com> Twitter:
> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
More information about the Vm-dev