[Vm-dev] Squeak socket problem ... help!
goran at krampe.se
Thu Oct 9 22:05:12 UTC 2014
Long email but... work for hire here! :)
Ron and I (3DICC) have a problem with the Unix VM networking and I am
reaching out before burning too many hours on something one of you
C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
So... we are running a large deployment of Terf (yay!) that is "pushing"
the server side VMs a bit. The large load has caused us to experience
Our Possibly Faulted Analysis So Far
One of our server side VMs, the one that unfortunately is a crucial
resource, locks up and doesn't respond on its most important listening
Socket port. VM does not crash however. We reboot it, because its a
stressful situation with LOTS of users being affected, so we haven't
Unfortunately the shell script starting the VMs wasn't catching stderr
to a log file (Argh! Now it does though so we will see if we get more
info later) so we have missed some info here but Ron "got lucky" and saw
this on his terminal (stderr of the VM going to his terminal instead of
select: Bad file descriptor"
It took us quite some time before we realized this was indeed Squeak
talking, and that it was from inside aio.c - a call from aioPoll():
...ok. Some important background info:
Before this we hit the default ulimit of 1024 per user (duh!), causing
"Too Many Files Open", so we raised them silly high. That did make the
system handle itself - but in retrospect we think another issue (related
to SAML auth) caused tons of client requests getting spawned from this
VM and thus is what made us reach the limit in the first place. It may
also have been the factor (=many many sockets) that in the end caused
the errno 9 described above - see below for reasoning.
After perusing the IMHO highly confusing :) Socket code (no offense of
course, its probably trivial to a Unix C-guru) at least we understand
that the code uses select() and not a more modern poll() or epoll(). In
fact there is also a call to select() in sqUnixSocket.c, but... probably
Yeah, epoll() is not portable etc, we know, but frankly we only care for
Googling shows us further that select() has issues, I mean, yikes. And
the thing I think we might be hitting here is the fact that select()
doesn't handle more than 1024 file descriptors!!! (as far as I can
understand the writing on the Internet) and to make it worse, it seems
to be able to go bananas if you push it there...
Now... again, Internet seems to imply that the "usual" cause of "errno
9" is doing a select on an fd that has already been closed. Typical bug
causing this is accidentally closing an fd twice - and thus, if you are
unlucky, accidentally the second time closing the fd when it actually
has already managed to get reused and thus is open agian. Oops.
But it seems unreasonable to think that *such* a bug exists in this code
after all these years. And I am simply guessing its the >1024 fd problem
biting us, but yeah, we don't know. And I also guess it was that SAML
issue, in combination with raised ulimits, that made us even get over 1024.
Things we are planning:
- Come up with a test case showing it blowing up. Will try to do that
- Start looking at how to use poll() or epoll() instead, because we need
to be SOLID here and we can't afford the 1024 limit.
So... anyone interested? Any brilliant thoughts? AFAICT we can rest
assured that there is a bug here somewhere, because otherwise we
wouldn't be able to get "errno 9 Bad file descriptor", right?
PS. Googling this in relation to the Squeak VM didn't show any recent
hits, only some from Stephen Pair fixing a bug in X11 code etc.
More information about the Vm-dev