[Vm-dev] Squeak socket problem ... help!
eliot.miranda at gmail.com
Thu Oct 9 22:27:05 UTC 2014
On Thu, Oct 9, 2014 at 3:05 PM, Göran Krampe <goran at krampe.se> wrote:
> Hi guys!
> Long email but... work for hire here! :)
> In short:
> Ron and I (3DICC) have a problem with the Unix VM networking and I am
> reaching out before burning too many hours on something one of you
> C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
Cool. This is likely easy to fix. Your image is running out of file
descriptors. Track open and close calls, e.g. add logging around at
and their associated close calls and see what's being opened without being
closed. It shoudl be easy to track=down, but may be more difficult to fix.
> In looong:
> So... we are running a large deployment of Terf (yay!) that is "pushing"
> the server side VMs a bit. The large load has caused us to experience some
> Our Possibly Faulted Analysis So Far
> One of our server side VMs, the one that unfortunately is a crucial
> resource, locks up and doesn't respond on its most important listening
> Socket port. VM does not crash however. We reboot it, because its a
> stressful situation with LOTS of users being affected, so we haven't looked
> Unfortunately the shell script starting the VMs wasn't catching stderr to
> a log file (Argh! Now it does though so we will see if we get more info
> later) so we have missed some info here but Ron "got lucky" and saw this on
> his terminal (stderr of the VM going to his terminal instead of log file):
> "errno 9
> select: Bad file descriptor"
> It took us quite some time before we realized this was indeed Squeak
> talking, and that it was from inside aio.c - a call from aioPoll():
> ...ok. Some important background info:
> Before this we hit the default ulimit of 1024 per user (duh!), causing
> "Too Many Files Open", so we raised them silly high. That did make the
> system handle itself - but in retrospect we think another issue (related to
> SAML auth) caused tons of client requests getting spawned from this VM and
> thus is what made us reach the limit in the first place. It may also have
> been the factor (=many many sockets) that in the end caused the errno 9
> described above - see below for reasoning.
> After perusing the IMHO highly confusing :) Socket code (no offense of
> course, its probably trivial to a Unix C-guru) at least we understand that
> the code uses select() and not a more modern poll() or epoll(). In fact
> there is also a call to select() in sqUnixSocket.c, but... probably not
> Yeah, epoll() is not portable etc, we know, but frankly we only care for
> Linux here.
> Googling shows us further that select() has issues, I mean, yikes. And the
> thing I think we might be hitting here is the fact that select() doesn't
> handle more than 1024 file descriptors!!! (as far as I can understand the
> writing on the Internet) and to make it worse, it seems to be able to go
> bananas if you push it there...
> Now... again, Internet seems to imply that the "usual" cause of "errno 9"
> is doing a select on an fd that has already been closed. Typical bug
> causing this is accidentally closing an fd twice - and thus, if you are
> unlucky, accidentally the second time closing the fd when it actually has
> already managed to get reused and thus is open agian. Oops.
> But it seems unreasonable to think that *such* a bug exists in this code
> after all these years. And I am simply guessing its the >1024 fd problem
> biting us, but yeah, we don't know. And I also guess it was that SAML
> issue, in combination with raised ulimits, that made us even get over 1024.
> Things we are planning:
> - Come up with a test case showing it blowing up. Will try to do that next
> - Start looking at how to use poll() or epoll() instead, because we need
> to be SOLID here and we can't afford the 1024 limit.
> So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> assured that there is a bug here somewhere, because otherwise we wouldn't
> be able to get "errno 9 Bad file descriptor", right?
> regards, Göran
> PS. Googling this in relation to the Squeak VM didn't show any recent
> hits, only some from Stephen Pair fixing a bug in X11 code etc.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Vm-dev