[Vm-dev] Squeak socket problem ... help!

Ron Teitelbaum ron at usmedrec.com
Fri Oct 10 03:56:03 UTC 2014


Hi All!

Thanks for your help!!

Trying to reproduce the error I ran 1000 clients against our test cluster of
7 servers.

The fd size hovers around 250 and never made it above 490 (there were 4
service providers so the main server should only handle around 250 people).
David, Thanks for the pointers on how to find the fd size.  Very cool.
Since the fd size seemed to be well within proper limits and went both up
and down I think you are correct we may be looking at the wrong thing.
Could be that we are closing a socket twice?  (see 0xffa5e0cc M
Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
that said on a very busy server you may try closing an already closed socket
that had already been reused.  On not so busy system you can get away with
it so you may not see this error until you put a very large load on it.

I captured the error: 

This repeated about 1000 times (no idea why it would repeat so many times):

errno 9
select: Bad file descriptor

Then:

Segmentation fault Fri Oct 10 02:46:17 2014


Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
[Production ITHB VM]
Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog
Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins
Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
/home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]


C stack backtrace & registers:
        eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
        edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
        eip 0xffa19288
*[0xffa19288]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
54e]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
84f]
[0x5ad410]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etCloseConnection+0x26)[0x811f0a6]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etDestroy+0x40)[0x811fad0]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
9f8]
[0xd7444848]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
ret+0x76e)[0x8084eee]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
x2b4)[0x805efa4]
/lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
d41]
[0xffa18c80]


Smalltalk stack dump:
0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
Socket
0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
0xd9c36020 s [] in BlockClosure>newProcess

Most recent primitives
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
signal
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
millisecondClockValue
millisecondClockValue
basicNew:
at:put:
basicNew:
integerAt:put:size:signed:
basicNew:
primitiveARC4Process:from:to:m:x:y:
findSubstring:in:startingAt:matchTable:
basicNew
stringHash:initialHash:
hashBytes:startingWith:
basicNew
at:put:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
stringHash:initialHash:
basicNew
stringHash:initialHash:
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
at:put:
at:put:
signal
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
signal
primitiveWait
primitiveEnterCriticalSection
replaceFrom:to:with:startingAt:
primSocketConnectionStatus:
primitiveExitCriticalSection
suspend
primSocketReceiveDataAvailable:
primSocketReceiveDataAvailable:
primSocketConnectionStatus:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew
millisecondClockValue
=
digitMultiply:neg:
<
basicNew
pointsTo:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primitiveWait
at:put:
signal
signal
basicNew
primitiveWait
at:put:
signal
signal
primSocketConnectionStatus:
primSocketDestroy:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2164

        (Segmentation fault)

> From: David T. Lewis
> Sent: Thursday, October 09, 2014 8:22 PM
> 
> On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:
> >
> > Hi guys!

Hi Eliot, hope you are happy and well!

> >
> > On 10/10/2014 01:28 AM, David T. Lewis wrote:
> > >>>Ron and I (3DICC) have a problem with the Unix VM networking and I
> > >>>am reaching out before burning too many hours on something one of
> > >>>you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
> > >>>for your trouble.
> > >>
> > >>Cool.  This is likely easy to fix.  Your image is running out of
> > >>file descriptors.  Track open and close calls, e.g. add logging
> > >>around at least StandardFileStream>>#primOpen:writable:
> > >>, AsyncFile>>#primOpen:forWrite:semaIndex:,
> >
> >>Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
> dSemaIndex:writeSemaIndex:
> > >>and their associated close calls and see what's being opened without
> > >>being closed.  It shoudl be easy to track=down, but may be more
> > >>difficult to fix.
> > >>
> > >>Good luck!
> >
> > Aha. Soo... am I understanding this correctly - we are probably
> > leaking fds and when we go above 1024 this makes select() go bonkers
> > and eventually leads to the "Bad file descriptor" error?
> >
> > >I agree with what Eliot is saying and would add a few thoughts:
> >
> > >- Don't fix the wrong problem (DFtWP). Unless you have some reason to
> > >believe that this server application would realistically have a need
> > >to handle anything close to a thousand concurrent TCP sessions, don't
> > >fix it by raising the per-process file handle limit, and don't fix it
> > >by reimplementing the socket listening code.
> >
> > We haven't done the exact numbers, but we could probably hit several
> > hundreds concurrent at least. 1024 seemed a bit "over the top" though
> > :)
> >
> > The system in question is meant to serve more than 1000 concurrent
> > users, so we are in fact moving into this territory. We have been up
> > to around 600 so far.
> >
> > >- It is entirely possible that no one before you has ever tried to
> > >run a server application with the per-process file handle limit
> > >bumped up above the default 1024. So if that configuration does not
> > >play nicely with the select() mechanism, you may well be the first to
> > >have encountered this as an issue. But see above, don't fix it if it
ain't
> broke.
> >
> > Well, it most probably *is* broke - I mean - I haven't read anywhere
> > that our Socket code is limited to 1024 concurrent sockets and that
> > going above that limit causes the Socket code to stop working? :)
> >
> > But I agree - I don't want to touch that code if we can simply avoid
> > this bug by making sure we stay below 1024.
> >
> > But it sounds broke to me, nevertheless. ;)
> 
> Indeed it probably is.
> 
> >
> > >- Most "out of file descriptor" problems involve resource leaks (as
> > >Eliot is suggesting), and in those cases you will see a gradual
> > >increase in file descriptors in /proc/<vmpid>/fd/ over time.

Thank you for the pointer.  That's a pretty cool directory to look at!

> > >Eventually you run out of descriptors and something horrible happens.
> >
> > We will start looking at that and other tools too.
> >
> >
> > >- Sorry to repeat myself but this is by far the most important point:
> > >DFtWP.
> >
> > Sure :). This is why I posted - to get your input. And I have a
> > suspicion that the SAML issue I mentioned may be the code leaking, we
> > will start looking.
> >
> > regards, G?ran
> 
> Cool. Please keep us posted on progress. In the unlikely event that free
> advice turns out to be worth anything, there is plenty more where that
> came from ;-)
> 
> Dave
> 

Ron Teitelbaum




More information about the Vm-dev mailing list