http://forum.world.st/Socket-s-readSemaphore-is-losing-signals-with-Cog-on-L...
On 8/13/2011 13:42, Levente Uzonyi wrote:
Socket's readSemaphore is losing signals with CogVMs on linux. We found several cases (RFB, PostgreSQL) when processes are stuck in the following method:
Socket >> waitForDataIfClosed: closedBlock "Wait indefinitely for data to arrive. This method will block until data is available or the socket is closed."
[ (self primSocketReceiveDataAvailable: socketHandle) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait ] repeat
When we inspect the contexts, the process is waiting for the readSemaphore, but evaluating (self primSocketReceiveDataAvailable: socketHandle) yields true. Signaling the readSemaphore makes the process running again. As a workaround we replaced #wait with #waitTimeoutMSecs: and all our problems disappeared.
The interpreter VM doesn't seem to have this bug, so I guess the bug was introduced with the changes of aio.c.
« [hide part of quote]
Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a bug in Cog which causes this. FWIW, here is the relevant portion:
"Soft 500ms busy loop - to protect against AIO probs; occasionally, VM-level AIO fails to trip the semaphore" self readSemaphore waitTimeoutMSecs: 500.
Cheers, - Andreas _,,,^..^,,,_ best, Eliot
Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?
Levente
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
Levente
Socket compile: 'sloppyWaitForDataIfClosed: closedBlock
[(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat' classified: 'waiting'.
[ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ].
serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume.
"Evaluate these after debugging: clientSocket destroy. serverSocket destroy."
On Wed, 23 Mar 2016, Eliot Miranda wrote:
Hi Levente, On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote: Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
On 24 Mar 2016, at 10:16, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
does it happen if you have strace -o output.txt -f on the VM as well?
kind regards holger
Hi Holger,
The snippet doesn't seem to trigger the issue when strace is attached to the process. But it usually runs into the issue short after I stop strace.
Levente
On Thu, 24 Mar 2016, Holger Freyther wrote:
On 24 Mar 2016, at 10:16, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
does it happen if you have strace -o output.txt -f on the VM as well?
kind regards holger
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
Levente
Socket compile: 'sloppyWaitForDataIfClosed: closedBlock
[(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat'
classified: 'waiting'.
[ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ].
serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume.
"Evaluate these after debugging: clientSocket destroy. serverSocket destroy."
Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.
(Stephan, interesting suggestion to throw hardware at the problem, thank you).
On Wed, 23 Mar 2016, Eliot Miranda wrote:
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote: Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if
it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
On 25 Mar 2016, at 00:35, Eliot Miranda eliot.miranda@gmail.com wrote:
Dear Eliot,
Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.
before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors.
holger
[1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735
Hi Holger,
let's collaborate on this. You and I and lots of others want this to be done. But in fixing the below bug my priorities are to a) deliver a fix promptly b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines
So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap.
On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther holger@freyther.de wrote:
On 25 Mar 2016, at 00:35, Eliot Miranda eliot.miranda@gmail.com wrote:
Dear Eliot,
Fabulous, thank you! replace the self halts with e.g. self assert:
(clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.
before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors.
holger
[1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735
Hi Levente,
the first thing to report is that this isn't a Cog-specific bug. It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image.
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
Levente
Socket compile: 'sloppyWaitForDataIfClosed: closedBlock
[(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat'
classified: 'waiting'.
[ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ].
serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume.
"Evaluate these after debugging: clientSocket destroy. serverSocket destroy."
On Wed, 23 Mar 2016, Eliot Miranda wrote:
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote: Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if
it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
Hi Eliot,
AFAIK the workaround came from the Qwaq project, and it had been there long before it was added to the Trunk.
Levente
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
Levente
Thanks very much. In Cog the bug was that the sqAtomicOps.h implementation for atomic increment had been changed to use an intrinsic that operated on 32-bit or 64-bit quantities, but not on the 16-bit quantities that sqExternalSemaphores.c used to coordinate the multiple writers/single reader signalSemaphoreWithIndex scheme. There may still be occasions in which signals are lost, but at least the below runs for several minutes without a problem. Dave, it would be great to know why it fails in the Interpreter VM on unix. The platform subsystem is quite different, so its odd that we see a similar symptom.
Socket compile: 'sloppyWaitForDataIfClosed: closedBlock
[(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat'
classified: 'waiting'.
[ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ].
serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume.
"Evaluate these after debugging: clientSocket destroy. serverSocket destroy."
On Wed, 23 Mar 2016, Eliot Miranda wrote:
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote: Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if
it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
On Mon, Mar 28, 2016 at 03:46:22PM -0700, Eliot Miranda wrote:
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
Levente
Thanks very much. In Cog the bug was that the sqAtomicOps.h implementation for atomic increment had been changed to use an intrinsic that operated on 32-bit or 64-bit quantities, but not on the 16-bit quantities that sqExternalSemaphores.c used to coordinate the multiple writers/single reader signalSemaphoreWithIndex scheme. There may still be occasions in which signals are lost, but at least the below runs for several minutes without a problem. Dave, it would be great to know why it fails in the Interpreter VM on unix. The platform subsystem is quite different, so its odd that we see a similar symptom.
Hi Eliot,
I set up Levente's test with an interpreter VM, and I'll see if I can get any further insight. On my first test, it ran for maybe 5 or 10 minutes before encountering a halt. A second test ran for about 10 minutes or so, and now seems to have a locked up my image, but with the VM running normally according to /proc/<pid>/status. I'm not sure what to make of that, but I'll try it a few more times and see if there is any repeatable pattern. The image that I am using for this is a 4.6 image updated to more or less current trunk level, so I'm not sure how stable it is.
Dave
Socket compile: 'sloppyWaitForDataIfClosed: closedBlock
[(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat'
classified: 'waiting'.
[ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ].
serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume.
"Evaluate these after debugging: clientSocket destroy. serverSocket destroy."
On Wed, 23 Mar 2016, Eliot Miranda wrote:
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi leves@caesar.elte.hu wrote: Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if
it happens once every few hours or do you need something that you can trigger on demand?
I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(
Levente
_,,,^..^,,,_ best, Eliot
-- _,,,^..^,,,_ best, Eliot
vm-dev@lists.squeakfoundation.org