[Vm-dev] Squeak socket problem ... help!

Thu Oct 9 23:28:14 UTC 2014

On Thu, Oct 09, 2014 at 03:27:05PM -0700, Eliot Miranda wrote:
>  
> Hi Both,
> 
> On Thu, Oct 9, 2014 at 3:05 PM, G??ran Krampe <goran at krampe.se> wrote:
> 
> >
> > Hi guys!
> >
> > Long email but... work for hire here! :)
> >
> > In short:
> >
> > Ron and I (3DICC) have a problem with the Unix VM networking and I am
> > reaching out before burning too many hours on something one of you
> > C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> > trouble.
> >
> 
> Cool.  This is likely easy to fix.  Your image is running out of file
> descriptors.  Track open and close calls, e.g. add logging around at
> least StandardFileStream>>#primOpen:writable:
> , AsyncFile>>#primOpen:forWrite:semaIndex:,
> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> and their associated close calls and see what's being opened without being
> closed.  It shoudl be easy to track=down, but may be more difficult to fix.
> 
> Good luck!

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need to
handle anything close to a thousand concurrent TCP sessions, don't fix
it by raising the per-process file handle limit, and don't fix it by
reimplementing the socket listening code.

- It is entirely possible that no one before you has ever tried to run
a server application with the per-process file handle limit bumped up
above the default 1024. So if that configuration does not play nicely
with the select() mechanism, you may well be the first to have encountered
this as an issue. But see above, don't fix it if it ain't broke.

- Most "out of file descriptor" problems involve resource leaks (as Eliot
is suggesting), and in those cases you will see a gradual increase in file
descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
descriptors and something horrible happens.

- Before doing anything else, you must confirm if this is a resource leak,
with file descriptor use continuously increasing (which is what Eliot and
I are both assuming to be the case here), or if it is a real resource
issue in which your server has a legitimate need to maintain a very large
number of TCP connections concurrently. Given that you have a running
application with real users, you will probably want to do this with something
like a shell script keeping track of the /proc/<pid>/fd/ directory for
the running VM. (In squeaksource.com, there is an undiagnosed file handle
leak similar to what I think you are experiencing. My kludgy workaround is
a process in the image that uses OSProcess to count entries in /proc/<pid>/fd/
and restart the image when the file descriptor situation becomes dire).

- Finding file (socket) handle leaks is trickly, and if you have customers
depending on this, you probably do not have the luxury of fixing it right.
Is there any way to periodically restart the server image without causing
pain to the customer? If so, consider a kludgy workaround like I did for
squeaksource. Monitor the VM process for file handle leaks and restart
it proactively rather than waiting for a catastrophic failure. You can
do this all from within the image, I will dig out my squeaksource hack if
you think it may be of any help.

- Sorry to repeat myself but this is by far the most important point: DFtWP.

Dave

> 
> 
> > In looong:
> >
> > So... we are running a large deployment of Terf (yay!) that is "pushing"
> > the server side VMs a bit. The large load has caused us to experience some
> > issues.
> >
> >
> > Our Possibly Faulted Analysis So Far
> > ====================================
> >
> > One of our server side VMs, the one that unfortunately is a crucial
> > resource, locks up and doesn't respond on its most important listening
> > Socket port. VM does not crash however. We reboot it, because its a
> > stressful situation with LOTS of users being affected, so we haven't looked
> > "inside".
> >
> > Unfortunately the shell script starting the VMs wasn't catching stderr to
> > a log file (Argh! Now it does though so we will see if we get more info
> > later) so we have missed some info here but Ron "got lucky" and saw this on
> > his terminal (stderr of the VM going to his terminal instead of log file):
> >
> > "errno 9
> > select: Bad file descriptor"
> >
> > It took us quite some time before we realized this was indeed Squeak
> > talking, and that it was from inside aio.c - a call from aioPoll():
> >         perror("select")
> >
> > ...ok. Some important background info:
> >
> > Before this we hit the default ulimit of 1024 per user (duh!), causing
> > "Too Many Files Open", so we raised them silly high. That did make the
> > system handle itself - but in retrospect we think another issue (related to
> > SAML auth) caused tons of client requests getting spawned from this VM and
> > thus is what made us reach the limit in the first place. It may also have
> > been the factor (=many many sockets) that in the end caused the errno 9
> > described above - see below for reasoning.
> >
> > After perusing the IMHO highly confusing :) Socket code (no offense of
> > course, its probably trivial to a Unix C-guru) at least we understand that
> > the code uses select() and not a more modern poll() or epoll(). In fact
> > there is also a call to select() in sqUnixSocket.c, but... probably not
> > relevant.
> >
> > Yeah, epoll() is not portable etc, we know, but frankly we only care for
> > Linux here.
> >
> > Googling shows us further that select() has issues, I mean, yikes. And the
> > thing I think we might be hitting here is the fact that select() doesn't
> > handle more than 1024 file descriptors!!! (as far as I can understand the
> > writing on the Internet) and to make it worse, it seems to be able to go
> > bananas if you push it there...
> >
> > Now... again, Internet seems to imply that the "usual" cause of "errno 9"
> > is doing a select on an fd that has already been closed. Typical bug
> > causing this is accidentally closing an fd twice - and thus, if you are
> > unlucky, accidentally the second time closing the fd when it actually has
> > already managed to get reused and thus is open agian. Oops.
> >
> > But it seems unreasonable to think that *such* a bug exists in this code
> > after all these years. And I am simply guessing its the >1024 fd problem
> > biting us, but yeah, we don't know. And I also guess it was that SAML
> > issue, in combination with raised ulimits, that made us even get over 1024.
> >
> > Things we are planning:
> >
> > - Come up with a test case showing it blowing up. Will try to do that next
> > week.
> > - Start looking at how to use poll() or epoll() instead, because we need
> > to be SOLID here and we can't afford the 1024 limit.
> >
> > So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> > assured that there is a bug here somewhere, because otherwise we wouldn't
> > be able to get "errno 9 Bad file descriptor", right?
> >
> > regards, G??ran
> >
> > PS. Googling this in relation to the Squeak VM didn't show any recent
> > hits, only some from Stephen Pair fixing a bug in X11 code etc.
> >
> 
> 
> 
> -- 
> best,
> Eliot