[Vm-dev] Squeak socket problem ... help!

Fri Oct 10 00:42:07 UTC 2014

On Thu, Oct 9, 2014 at 4:28 PM, David T. Lewis <lewis at mail.msen.com> wrote:

>
> On Thu, Oct 09, 2014 at 03:27:05PM -0700, Eliot Miranda wrote:
> >
> > Hi Both,
> >
> > On Thu, Oct 9, 2014 at 3:05 PM, G??ran Krampe <goran at krampe.se> wrote:
> >
> > >
> > > Hi guys!
> > >
> > > Long email but... work for hire here! :)
> > >
> > > In short:
> > >
> > > Ron and I (3DICC) have a problem with the Unix VM networking and I am
> > > reaching out before burning too many hours on something one of you
> > > C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for
> your
> > > trouble.
> > >
> >
> > Cool.  This is likely easy to fix.  Your image is running out of file
> > descriptors.  Track open and close calls, e.g. add logging around at
> > least StandardFileStream>>#primOpen:writable:
> > , AsyncFile>>#primOpen:forWrite:semaIndex:,
> >
> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> > and their associated close calls and see what's being opened without
> being
> > closed.  It shoudl be easy to track=down, but may be more difficult to
> fix.
> >
> > Good luck!
>
> I agree with what Eliot is saying and would add a few thoughts:
>
> - Don't fix the wrong problem (DFtWP). Unless you have some reason to
> believe that this server application would realistically have a need to
> handle anything close to a thousand concurrent TCP sessions, don't fix
> it by raising the per-process file handle limit, and don't fix it by
> reimplementing the socket listening code.
>
> - It is entirely possible that no one before you has ever tried to run
> a server application with the per-process file handle limit bumped up
> above the default 1024. So if that configuration does not play nicely
> with the select() mechanism, you may well be the first to have encountered
> this as an issue. But see above, don't fix it if it ain't broke.
>
> - Most "out of file descriptor" problems involve resource leaks (as Eliot
> is suggesting), and in those cases you will see a gradual increase in file
> descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
> descriptors and something horrible happens.
>
> - Before doing anything else, you must confirm if this is a resource leak,
> with file descriptor use continuously increasing (which is what Eliot and
> I are both assuming to be the case here), or if it is a real resource
> issue in which your server has a legitimate need to maintain a very large
> number of TCP connections concurrently. Given that you have a running
> application with real users, you will probably want to do this with
> something
> like a shell script keeping track of the /proc/<pid>/fd/ directory for
> the running VM. (In squeaksource.com, there is an undiagnosed file handle
> leak similar to what I think you are experiencing. My kludgy workaround is
> a process in the image that uses OSProcess to count entries in
> /proc/<pid>/fd/
> and restart the image when the file descriptor situation becomes dire).
>
> - Finding file (socket) handle leaks is trickly, and if you have customers
> depending on this, you probably do not have the luxury of fixing it right.
> Is there any way to periodically restart the server image without causing
> pain to the customer? If so, consider a kludgy workaround like I did for
> squeaksource. Monitor the VM process for file handle leaks and restart
> it proactively rather than waiting for a catastrophic failure. You can
> do this all from within the image, I will dig out my squeaksource hack if
> you think it may be of any help.
>
> - Sorry to repeat myself but this is by far the most important point:
> DFtWP.
>

Great message David.  You've nailed it.

>
> Dave
>
> >
> >
> > > In looong:
> > >
> > > So... we are running a large deployment of Terf (yay!) that is
> "pushing"
> > > the server side VMs a bit. The large load has caused us to experience
> some
> > > issues.
> > >
> > >
> > > Our Possibly Faulted Analysis So Far
> > > ====================================
> > >
> > > One of our server side VMs, the one that unfortunately is a crucial
> > > resource, locks up and doesn't respond on its most important listening
> > > Socket port. VM does not crash however. We reboot it, because its a
> > > stressful situation with LOTS of users being affected, so we haven't
> looked
> > > "inside".
> > >
> > > Unfortunately the shell script starting the VMs wasn't catching stderr
> to
> > > a log file (Argh! Now it does though so we will see if we get more info
> > > later) so we have missed some info here but Ron "got lucky" and saw
> this on
> > > his terminal (stderr of the VM going to his terminal instead of log
> file):
> > >
> > > "errno 9
> > > select: Bad file descriptor"
> > >
> > > It took us quite some time before we realized this was indeed Squeak
> > > talking, and that it was from inside aio.c - a call from aioPoll():
> > >         perror("select")
> > >
> > > ...ok. Some important background info:
> > >
> > > Before this we hit the default ulimit of 1024 per user (duh!), causing
> > > "Too Many Files Open", so we raised them silly high. That did make the
> > > system handle itself - but in retrospect we think another issue
> (related to
> > > SAML auth) caused tons of client requests getting spawned from this VM
> and
> > > thus is what made us reach the limit in the first place. It may also
> have
> > > been the factor (=many many sockets) that in the end caused the errno 9
> > > described above - see below for reasoning.
> > >
> > > After perusing the IMHO highly confusing :) Socket code (no offense of
> > > course, its probably trivial to a Unix C-guru) at least we understand
> that
> > > the code uses select() and not a more modern poll() or epoll(). In fact
> > > there is also a call to select() in sqUnixSocket.c, but... probably not
> > > relevant.
> > >
> > > Yeah, epoll() is not portable etc, we know, but frankly we only care
> for
> > > Linux here.
> > >
> > > Googling shows us further that select() has issues, I mean, yikes. And
> the
> > > thing I think we might be hitting here is the fact that select()
> doesn't
> > > handle more than 1024 file descriptors!!! (as far as I can understand
> the
> > > writing on the Internet) and to make it worse, it seems to be able to
> go
> > > bananas if you push it there...
> > >
> > > Now... again, Internet seems to imply that the "usual" cause of "errno
> 9"
> > > is doing a select on an fd that has already been closed. Typical bug
> > > causing this is accidentally closing an fd twice - and thus, if you are
> > > unlucky, accidentally the second time closing the fd when it actually
> has
> > > already managed to get reused and thus is open agian. Oops.
> > >
> > > But it seems unreasonable to think that *such* a bug exists in this
> code
> > > after all these years. And I am simply guessing its the >1024 fd
> problem
> > > biting us, but yeah, we don't know. And I also guess it was that SAML
> > > issue, in combination with raised ulimits, that made us even get over
> 1024.
> > >
> > > Things we are planning:
> > >
> > > - Come up with a test case showing it blowing up. Will try to do that
> next
> > > week.
> > > - Start looking at how to use poll() or epoll() instead, because we
> need
> > > to be SOLID here and we can't afford the 1024 limit.
> > >
> > > So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> > > assured that there is a bug here somewhere, because otherwise we
> wouldn't
> > > be able to get "errno 9 Bad file descriptor", right?
> > >
> > > regards, G??ran
> > >
> > > PS. Googling this in relation to the Squeak VM didn't show any recent
> > > hits, only some from Stephen Pair fixing a bug in X11 code etc.
> > >
> >
> >
> >
> > --
> > best,
> > Eliot
>
>

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20141009/cb9fbe2f/attachment-0001.htm