John M McIntosh wrote:
So is this on windows, or unix?
Windows Qwaq Forums client.
So how did you measure that?
Our original issue was that Python apps in Forums would "stop working" after running for hours. The way these apps work is by calling from Forums to Python and have callback facilities that allow Python code to invoke methods inside Forums. When the apps stopped working I could observe that it was when a callback was being executed, i.e., from the Python side everything was set up and the VM had entered the interpreter loop again. Except that the Python callback semaphore wasn't signaled.
I then changed that code to use waitTimeout: and count the number of times the callback semaphore was signaled (i.e., didn't time out) vs. the number of times we had callback data waiting. These numbers should be exactly the same and they weren't.
Since all of this is code that is under our control it means that I am 100% certain that we've been calling signalSemaphoreWithIndex() and that this wasn't delivered to the image. And obviously it's not a common event (2 out of 400k callbacks missed the signal).
I think you said you had a VM that does proper locking of the queues?
Yes. I don't think that's the problem. Right now my theory is that we're indeed overflowing the VMs semaphore buffer because a Python callout may take a long, long time. I think what happens then might be that over the period of time the (few) sockets generate multiple semaphore signals which overflows the VMs buffer and then there is no room left in the buffer when the callback executes.
If that's true then I should be able to recreate the problem by calling an OS-level sleep() function via FFI (i.e., block the main interpreter loop) while performing heavy network activity and see if that overflows the VMs buffer.
And if that's indeed the case then I think there are two actions to take: One is to fix the Windows sockets code to not do that ;-) (i.e., not signal an already signaled semaphore a gazillion times) but also to keep track of the number of signals on a particular semaphore instead of keeping an entry in the buffer each time the semaphore is signaled (which would completely solve this type of problem in general).
The next step for me will be to attribute our Python callback facilities to keep track of the time that's passed between entering Python and getting back to Forums and see if that correlates. Plus doing the sleep() test via FFI to see how long this needs to take before we overflow the VM buffer.
Cheers, - Andreas
On 5-May-09, at 5:52 PM, Andreas Raab wrote:
Folks -
Just as a follow-up to this note I now have proof that we're loosing semaphore signals occasionally. What I was able to detect was that when running forums over a period of 20 hours we lost 2 out of 421355 signals. We'll have the follow-on discussion on vm-dev since I don't think most people here are interested in discussing the possibilities of how this could happen and what to do about it. Please send any follow-ups to vm-dev (and vm-dev only).
Cheers,
- Andreas
--
John M. McIntosh johnmci@smalltalkconsulting.com Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ===========================================================================