[Vm-dev] Re: Squeak socket problem ... help!

Max Leske maxleske at gmail.com
Fri Oct 10 05:13:21 UTC 2014


Disclaimer:

I just realized I’m not 100% sure the scenario I described was the actual cause of errno 9. It might have been another thing we did, where we tried to reuse the memory handle for the socket (don’t judge… we duly facepalmed ourselves when we realized how silly that was :) ).

Still, the basic idea remains the same: make sure you don’t mess with sockets.

> Just my too cents:
> 
> We ran into the “errno 9 select: Bad file descriptor” error two weaks back while playing around with systemd and socket activation.
> In our case was the following:
> 1. send request to socket -> image starts through socket activation
> 2. image initializes itself with the special systemd socket handle and keeps track of it (because we want to reuse it after restarting the image)
> 3. save the image
> 4. errno 9
> 
> What we found then was, that when the image was being saved, stderr would be closed and then reopened by the system, as did the systemd socket (which was a different problem). Unfortunately we always assumed a fixed number for the systemd socket, namely 3. Now, when the image reopened stderr it could  happen that another socket had taken fd 2 and stderr would be on fd 3, thus we tried to open stderr as TCP socket.
> 
> What I want to say is, just really make sure that you’re not doing something funny with sockets (e.g. fixed fd number) before you look at the VM.
> 
> Good luck!
> 
> 
> Cheers,
> Max
> 
>> On 10.10.2014, at 05:56, vm-dev-request at lists.squeakfoundation.org <mailto:vm-dev-request at lists.squeakfoundation.org> wrote:
>> 
>> Date: Thu, 9 Oct 2014 23:56:03 -0400
>> From: "Ron Teitelbaum" <ron at usmedrec.com <mailto:ron at usmedrec.com>>
>> Subject: RE: [Vm-dev] Squeak socket problem ... help!
>> To: "'Squeak Virtual Machine Development Discussion'"
>> 	<vm-dev at lists.squeakfoundation.org <mailto:vm-dev at lists.squeakfoundation.org>>
>> Message-ID: <009601cfe43e$1f0ffb00$5d2ff100$@usmedrec.com <mailto:009601cfe43e$1f0ffb00$5d2ff100$@usmedrec.com>>
>> Content-Type: text/plain;	charset="iso-8859-1"
>> 
>> Hi All!
>> 
>> Thanks for your help!!
>> 
>> Trying to reproduce the error I ran 1000 clients against our test cluster of
>> 7 servers.
>> 
>> The fd size hovers around 250 and never made it above 490 (there were 4
>> service providers so the main server should only handle around 250 people).
>> David, Thanks for the pointers on how to find the fd size.  Very cool.
>> Since the fd size seemed to be well within proper limits and went both up
>> and down I think you are correct we may be looking at the wrong thing.
>> Could be that we are closing a socket twice?  (see 0xffa5e0cc M
>> Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
>> that said on a very busy server you may try closing an already closed socket
>> that had already been reused.  On not so busy system you can get away with
>> it so you may not see this error until you put a very large load on it.
>> 
>> I captured the error: 
>> 
>> This repeated about 1000 times (no idea why it would repeat so many times):
>> 
>> errno 9
>> select: Bad file descriptor
>> 
>> Then:
>> 
>> Segmentation fault Fri Oct 10 02:46:17 2014
>> 
>> 
>> Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
>> [Production ITHB VM]
>> Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
>> f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
>> With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
>> f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
>> Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog <http://www.squeakvm.org/svn/squeak/branches/Cog>
>> Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins <http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins>
>> Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
>> 2013 x86_64 x86_64 x86_64 GNU/Linux
>> plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
>> /home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]
>> 
>> 
>> C stack backtrace & registers:
>>        eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
>>        edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
>>        eip 0xffa19288
>> *[0xffa19288]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
>> 54e]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
>> 84f]
>> [0x5ad410]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
>> etCloseConnection+0x26)[0x811f0a6]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
>> etDestroy+0x40)[0x811fad0]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
>> 9f8]
>> [0xd7444848]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
>> ret+0x76e)[0x8084eee]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
>> x2b4)[0x805efa4]
>> /lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
>> /home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
>> d41]
>> [0xffa18c80]
>> 
>> 
>> Smalltalk stack dump:
>> 0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
>> 0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
>> QServiceProviderClient
>> 0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
>> QServiceProviderClient
>> 0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
>> QServiceProviderClient
>> 0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
>> Socket
>> 0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
>> QServiceProviderClient
>> 0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
>> QServiceProviderClient
>> 0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
>> 0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
>> 0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
>> 0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
>> 0xd9c36020 s [] in BlockClosure>newProcess
>> 
>> Most recent primitives
>> primSocketConnectionStatus:
>> primSocketSendDone:
>> primSocket:sendData:startIndex:count:
>> millisecondClockValue
>> primitiveWait
>> millisecondClockValue
>> millisecondClockValue
>> primitiveEnterCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> findNextUnwindContextUpTo:
>> tempAt:
>> tempAt:put:
>> tempAt:
>> terminateTo:
>> primitiveExitCriticalSection
>> findNextUnwindContextUpTo:
>> terminateTo:
>> primitiveExitCriticalSection
>> new:
>> at:put:
>> at:put:
>> updateCrc:from:to:in:
>> updateCrc:from:to:in:
>> bitXor:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> replaceFrom:to:with:startingAt:
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> =
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> replaceFrom:to:with:startingAt:
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> =
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> signal
>> primitiveEnterCriticalSection
>> primitiveEnterCriticalSection
>> findNextUnwindContextUpTo:
>> tempAt:
>> tempAt:put:
>> tempAt:
>> terminateTo:
>> primitiveExitCriticalSection
>> findNextUnwindContextUpTo:
>> terminateTo:
>> at:put:
>> at:put:
>> primitiveExitCriticalSection
>> primSocketConnectionStatus:
>> primSocketSendDone:
>> primSocketConnectionStatus:
>> primSocketSendDone:
>> primSocket:sendData:startIndex:count:
>> millisecondClockValue
>> primitiveWait
>> millisecondClockValue
>> primitiveWait
>> primitiveWait
>> primitiveWait
>> primitiveWait
>> primitiveWait
>> primitiveWait
>> millisecondClockValue
>> millisecondClockValue
>> basicNew:
>> at:put:
>> basicNew:
>> integerAt:put:size:signed:
>> basicNew:
>> primitiveARC4Process:from:to:m:x:y:
>> findSubstring:in:startingAt:matchTable:
>> basicNew
>> stringHash:initialHash:
>> hashBytes:startingWith:
>> basicNew
>> at:put:
>> replaceFrom:to:with:startingAt:
>> replaceFrom:to:with:startingAt:
>> stringHash:initialHash:
>> basicNew
>> stringHash:initialHash:
>> primitiveEnterCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> at:put:
>> at:put:
>> signal
>> millisecondClockValue
>> millisecondClockValue
>> primitiveEnterCriticalSection
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> findNextUnwindContextUpTo:
>> tempAt:
>> tempAt:put:
>> tempAt:
>> terminateTo:
>> primitiveExitCriticalSection
>> findNextUnwindContextUpTo:
>> terminateTo:
>> primitiveExitCriticalSection
>> new:
>> at:put:
>> at:put:
>> updateCrc:from:to:in:
>> updateCrc:from:to:in:
>> bitXor:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> at:put:
>> replaceFrom:to:with:startingAt:
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> =
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> replaceFrom:to:with:startingAt:
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> =
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> at:put:
>> primitiveExitCriticalSection
>> primitiveEnterCriticalSection
>> primitiveEnterCriticalSection
>> findNextUnwindContextUpTo:
>> tempAt:
>> tempAt:put:
>> tempAt:
>> terminateTo:
>> primitiveExitCriticalSection
>> findNextUnwindContextUpTo:
>> terminateTo:
>> at:put:
>> at:put:
>> primitiveExitCriticalSection
>> primSocketConnectionStatus:
>> primSocketSendDone:
>> primSocketConnectionStatus:
>> primSocketSendDone:
>> primSocket:sendData:startIndex:count:
>> millisecondClockValue
>> primitiveWait
>> basicNew
>> basicNew
>> primitiveEnterCriticalSection
>> new:
>> replaceFrom:to:with:startingAt:
>> at:put:
>> new:
>> replaceFrom:to:with:startingAt:
>> at:put:
>> primitiveExitCriticalSection
>> basicNew
>> basicNew
>> primitiveEnterCriticalSection
>> new:
>> replaceFrom:to:with:startingAt:
>> at:put:
>> new:
>> replaceFrom:to:with:startingAt:
>> at:put:
>> primitiveExitCriticalSection
>> basicNew
>> millisecondClockValue
>> primitiveWait
>> primitiveWait
>> primitiveWait
>> signal
>> primitiveWait
>> signal
>> primitiveWait
>> signal
>> primitiveWait
>> signal
>> primitiveWait
>> signal
>> primitiveWait
>> signal
>> signal
>> primitiveWait
>> primitiveEnterCriticalSection
>> replaceFrom:to:with:startingAt:
>> primSocketConnectionStatus:
>> primitiveExitCriticalSection
>> suspend
>> primSocketReceiveDataAvailable:
>> primSocketReceiveDataAvailable:
>> primSocketConnectionStatus:
>> basicNew:
>> replaceFrom:to:with:startingAt:
>> replaceFrom:to:with:startingAt:
>> basicNew
>> millisecondClockValue
>> =
>> digitMultiply:neg:
>> <
>> basicNew
>> pointsTo:
>> basicNew:
>> replaceFrom:to:with:startingAt:
>> replaceFrom:to:with:startingAt:
>> basicNew:
>> replaceFrom:to:with:startingAt:
>> replaceFrom:to:with:startingAt:
>> basicNew:
>> replaceFrom:to:with:startingAt:
>> replaceFrom:to:with:startingAt:
>> primitiveWait
>> at:put:
>> signal
>> signal
>> basicNew
>> primitiveWait
>> at:put:
>> signal
>> signal
>> primSocketConnectionStatus:
>> primSocketDestroy:
>> 
>> stack page bytes 4096 available headroom 3300 minimum unused headroom 2164
>> 
>>        (Segmentation fault)
>> 
>>> From: David T. Lewis
>>> Sent: Thursday, October 09, 2014 8:22 PM
>>> 
>>> On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:
>>>> 
>>>> Hi guys!
>> 
>> Hi Eliot, hope you are happy and well!
>> 
>>>> 
>>>> On 10/10/2014 01:28 AM, David T. Lewis wrote:
>>>>>>> Ron and I (3DICC) have a problem with the Unix VM networking and I
>>>>>>> am reaching out before burning too many hours on something one of
>>>>>>> you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
>>>>>>> for your trouble.
>>>>>> 
>>>>>> Cool.  This is likely easy to fix.  Your image is running out of
>>>>>> file descriptors.  Track open and close calls, e.g. add logging
>>>>>> around at least StandardFileStream>>#primOpen:writable:
>>>>>> , AsyncFile>>#primOpen:forWrite:semaIndex:,
>>>> 
>>>>> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
>>> dSemaIndex:writeSemaIndex:
>>>>>> and their associated close calls and see what's being opened without
>>>>>> being closed.  It shoudl be easy to track=down, but may be more
>>>>>> difficult to fix.
>>>>>> 
>>>>>> Good luck!
>>>> 
>>>> Aha. Soo... am I understanding this correctly - we are probably
>>>> leaking fds and when we go above 1024 this makes select() go bonkers
>>>> and eventually leads to the "Bad file descriptor" error?
>>>> 
>>>>> I agree with what Eliot is saying and would add a few thoughts:
>>>> 
>>>>> - Don't fix the wrong problem (DFtWP). Unless you have some reason to
>>>>> believe that this server application would realistically have a need
>>>>> to handle anything close to a thousand concurrent TCP sessions, don't
>>>>> fix it by raising the per-process file handle limit, and don't fix it
>>>>> by reimplementing the socket listening code.
>>>> 
>>>> We haven't done the exact numbers, but we could probably hit several
>>>> hundreds concurrent at least. 1024 seemed a bit "over the top" though
>>>> :)
>>>> 
>>>> The system in question is meant to serve more than 1000 concurrent
>>>> users, so we are in fact moving into this territory. We have been up
>>>> to around 600 so far.
>>>> 
>>>>> - It is entirely possible that no one before you has ever tried to
>>>>> run a server application with the per-process file handle limit
>>>>> bumped up above the default 1024. So if that configuration does not
>>>>> play nicely with the select() mechanism, you may well be the first to
>>>>> have encountered this as an issue. But see above, don't fix it if it
>> ain't
>>> broke.
>>>> 
>>>> Well, it most probably *is* broke - I mean - I haven't read anywhere
>>>> that our Socket code is limited to 1024 concurrent sockets and that
>>>> going above that limit causes the Socket code to stop working? :)
>>>> 
>>>> But I agree - I don't want to touch that code if we can simply avoid
>>>> this bug by making sure we stay below 1024.
>>>> 
>>>> But it sounds broke to me, nevertheless. ;)
>>> 
>>> Indeed it probably is.
>>> 
>>>> 
>>>>> - Most "out of file descriptor" problems involve resource leaks (as
>>>>> Eliot is suggesting), and in those cases you will see a gradual
>>>>> increase in file descriptors in /proc/<vmpid>/fd/ over time.
>> 
>> Thank you for the pointer.  That's a pretty cool directory to look at!
>> 
>>>>> Eventually you run out of descriptors and something horrible happens.
>>>> 
>>>> We will start looking at that and other tools too.
>>>> 
>>>> 
>>>>> - Sorry to repeat myself but this is by far the most important point:
>>>>> DFtWP.
>>>> 
>>>> Sure :). This is why I posted - to get your input. And I have a
>>>> suspicion that the SAML issue I mentioned may be the code leaking, we
>>>> will start looking.
>>>> 
>>>> regards, G?ran
>>> 
>>> Cool. Please keep us posted on progress. In the unlikely event that free
>>> advice turns out to be worth anything, there is plenty more where that
>>> came from ;-)
>>> 
>>> Dave
>>> 
>> 
>> Ron Teitelbaum
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20141010/342d85d5/attachment-0001.htm


More information about the Vm-dev mailing list