<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 9, 2014 at 4:28 PM, David T. Lewis <span dir="ltr"><<a href="mailto:lewis@mail.msen.com" target="_blank">lewis@mail.msen.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
On Thu, Oct 09, 2014 at 03:27:05PM -0700, Eliot Miranda wrote:<br>
><br>
> Hi Both,<br>
><br>
</span><span class="">> On Thu, Oct 9, 2014 at 3:05 PM, G??ran Krampe <<a href="mailto:goran@krampe.se">goran@krampe.se</a>> wrote:<br>
><br>
> ><br>
> > Hi guys!<br>
> ><br>
> > Long email but... work for hire here! :)<br>
> ><br>
> > In short:<br>
> ><br>
> > Ron and I (3DICC) have a problem with the Unix VM networking and I am<br>
> > reaching out before burning too many hours on something one of you<br>
> > C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your<br>
> > trouble.<br>
> ><br>
><br>
> Cool. This is likely easy to fix. Your image is running out of file<br>
> descriptors. Track open and close calls, e.g. add logging around at<br>
> least StandardFileStream>>#primOpen:writable:<br>
> , AsyncFile>>#primOpen:forWrite:semaIndex:,<br>
> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:<br>
> and their associated close calls and see what's being opened without being<br>
> closed. It shoudl be easy to track=down, but may be more difficult to fix.<br>
><br>
> Good luck!<br>
<br>
</span>I agree with what Eliot is saying and would add a few thoughts:<br>
<br>
- Don't fix the wrong problem (DFtWP). Unless you have some reason to<br>
believe that this server application would realistically have a need to<br>
handle anything close to a thousand concurrent TCP sessions, don't fix<br>
it by raising the per-process file handle limit, and don't fix it by<br>
reimplementing the socket listening code.<br>
<br>
- It is entirely possible that no one before you has ever tried to run<br>
a server application with the per-process file handle limit bumped up<br>
above the default 1024. So if that configuration does not play nicely<br>
with the select() mechanism, you may well be the first to have encountered<br>
this as an issue. But see above, don't fix it if it ain't broke.<br>
<br>
- Most "out of file descriptor" problems involve resource leaks (as Eliot<br>
is suggesting), and in those cases you will see a gradual increase in file<br>
descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of<br>
descriptors and something horrible happens.<br>
<br>
- Before doing anything else, you must confirm if this is a resource leak,<br>
with file descriptor use continuously increasing (which is what Eliot and<br>
I are both assuming to be the case here), or if it is a real resource<br>
issue in which your server has a legitimate need to maintain a very large<br>
number of TCP connections concurrently. Given that you have a running<br>
application with real users, you will probably want to do this with something<br>
like a shell script keeping track of the /proc/<pid>/fd/ directory for<br>
the running VM. (In <a href="http://squeaksource.com" target="_blank">squeaksource.com</a>, there is an undiagnosed file handle<br>
leak similar to what I think you are experiencing. My kludgy workaround is<br>
a process in the image that uses OSProcess to count entries in /proc/<pid>/fd/<br>
and restart the image when the file descriptor situation becomes dire).<br>
<br>
- Finding file (socket) handle leaks is trickly, and if you have customers<br>
depending on this, you probably do not have the luxury of fixing it right.<br>
Is there any way to periodically restart the server image without causing<br>
pain to the customer? If so, consider a kludgy workaround like I did for<br>
squeaksource. Monitor the VM process for file handle leaks and restart<br>
it proactively rather than waiting for a catastrophic failure. You can<br>
do this all from within the image, I will dig out my squeaksource hack if<br>
you think it may be of any help.<br>
<br>
- Sorry to repeat myself but this is by far the most important point: DFtWP.<br></blockquote><div><br></div><div>Great message David. You've nailed it.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Dave<br>
<div><div class="h5"><br>
><br>
><br>
> > In looong:<br>
> ><br>
> > So... we are running a large deployment of Terf (yay!) that is "pushing"<br>
> > the server side VMs a bit. The large load has caused us to experience some<br>
> > issues.<br>
> ><br>
> ><br>
> > Our Possibly Faulted Analysis So Far<br>
> > ====================================<br>
> ><br>
> > One of our server side VMs, the one that unfortunately is a crucial<br>
> > resource, locks up and doesn't respond on its most important listening<br>
> > Socket port. VM does not crash however. We reboot it, because its a<br>
> > stressful situation with LOTS of users being affected, so we haven't looked<br>
> > "inside".<br>
> ><br>
> > Unfortunately the shell script starting the VMs wasn't catching stderr to<br>
> > a log file (Argh! Now it does though so we will see if we get more info<br>
> > later) so we have missed some info here but Ron "got lucky" and saw this on<br>
> > his terminal (stderr of the VM going to his terminal instead of log file):<br>
> ><br>
> > "errno 9<br>
> > select: Bad file descriptor"<br>
> ><br>
> > It took us quite some time before we realized this was indeed Squeak<br>
> > talking, and that it was from inside aio.c - a call from aioPoll():<br>
> > perror("select")<br>
> ><br>
> > ...ok. Some important background info:<br>
> ><br>
> > Before this we hit the default ulimit of 1024 per user (duh!), causing<br>
> > "Too Many Files Open", so we raised them silly high. That did make the<br>
> > system handle itself - but in retrospect we think another issue (related to<br>
> > SAML auth) caused tons of client requests getting spawned from this VM and<br>
> > thus is what made us reach the limit in the first place. It may also have<br>
> > been the factor (=many many sockets) that in the end caused the errno 9<br>
> > described above - see below for reasoning.<br>
> ><br>
> > After perusing the IMHO highly confusing :) Socket code (no offense of<br>
> > course, its probably trivial to a Unix C-guru) at least we understand that<br>
> > the code uses select() and not a more modern poll() or epoll(). In fact<br>
> > there is also a call to select() in sqUnixSocket.c, but... probably not<br>
> > relevant.<br>
> ><br>
> > Yeah, epoll() is not portable etc, we know, but frankly we only care for<br>
> > Linux here.<br>
> ><br>
> > Googling shows us further that select() has issues, I mean, yikes. And the<br>
> > thing I think we might be hitting here is the fact that select() doesn't<br>
> > handle more than 1024 file descriptors!!! (as far as I can understand the<br>
> > writing on the Internet) and to make it worse, it seems to be able to go<br>
> > bananas if you push it there...<br>
> ><br>
> > Now... again, Internet seems to imply that the "usual" cause of "errno 9"<br>
> > is doing a select on an fd that has already been closed. Typical bug<br>
> > causing this is accidentally closing an fd twice - and thus, if you are<br>
> > unlucky, accidentally the second time closing the fd when it actually has<br>
> > already managed to get reused and thus is open agian. Oops.<br>
> ><br>
> > But it seems unreasonable to think that *such* a bug exists in this code<br>
> > after all these years. And I am simply guessing its the >1024 fd problem<br>
> > biting us, but yeah, we don't know. And I also guess it was that SAML<br>
> > issue, in combination with raised ulimits, that made us even get over 1024.<br>
> ><br>
> > Things we are planning:<br>
> ><br>
> > - Come up with a test case showing it blowing up. Will try to do that next<br>
> > week.<br>
> > - Start looking at how to use poll() or epoll() instead, because we need<br>
> > to be SOLID here and we can't afford the 1024 limit.<br>
> ><br>
> > So... anyone interested? Any brilliant thoughts? AFAICT we can rest<br>
> > assured that there is a bug here somewhere, because otherwise we wouldn't<br>
> > be able to get "errno 9 Bad file descriptor", right?<br>
> ><br>
</div></div>> > regards, G??ran<br>
<div class="HOEnZb"><div class="h5">> ><br>
> > PS. Googling this in relation to the Squeak VM didn't show any recent<br>
> > hits, only some from Stephen Pair fixing a bug in X11 code etc.<br>
> ><br>
><br>
><br>
><br>
> --<br>
> best,<br>
> Eliot<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>best,<div>Eliot</div>
</div></div>