[Vm-dev] Squeak socket problem ... help!

Göran Krampe goran at krampe.se
Thu Oct 9 22:05:12 UTC 2014


Hi guys!

Long email but... work for hire here! :)

In short:

Ron and I (3DICC) have a problem with the Unix VM networking and I am 
reaching out before burning too many hours on something one of you 
C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your 
trouble.

In looong:

So... we are running a large deployment of Terf (yay!) that is "pushing" 
the server side VMs a bit. The large load has caused us to experience 
some issues.


Our Possibly Faulted Analysis So Far
====================================

One of our server side VMs, the one that unfortunately is a crucial 
resource, locks up and doesn't respond on its most important listening 
Socket port. VM does not crash however. We reboot it, because its a 
stressful situation with LOTS of users being affected, so we haven't 
looked "inside".

Unfortunately the shell script starting the VMs wasn't catching stderr 
to a log file (Argh! Now it does though so we will see if we get more 
info later) so we have missed some info here but Ron "got lucky" and saw 
this on his terminal (stderr of the VM going to his terminal instead of 
log file):

"errno 9
select: Bad file descriptor"

It took us quite some time before we realized this was indeed Squeak 
talking, and that it was from inside aio.c - a call from aioPoll():
	perror("select")

...ok. Some important background info:

Before this we hit the default ulimit of 1024 per user (duh!), causing 
"Too Many Files Open", so we raised them silly high. That did make the 
system handle itself - but in retrospect we think another issue (related 
to SAML auth) caused tons of client requests getting spawned from this 
VM and thus is what made us reach the limit in the first place. It may 
also have been the factor (=many many sockets) that in the end caused 
the errno 9 described above - see below for reasoning.

After perusing the IMHO highly confusing :) Socket code (no offense of 
course, its probably trivial to a Unix C-guru) at least we understand 
that the code uses select() and not a more modern poll() or epoll(). In 
fact there is also a call to select() in sqUnixSocket.c, but... probably 
not relevant.

Yeah, epoll() is not portable etc, we know, but frankly we only care for 
Linux here.

Googling shows us further that select() has issues, I mean, yikes. And 
the thing I think we might be hitting here is the fact that select() 
doesn't handle more than 1024 file descriptors!!! (as far as I can 
understand the writing on the Internet) and to make it worse, it seems 
to be able to go bananas if you push it there...

Now... again, Internet seems to imply that the "usual" cause of "errno 
9" is doing a select on an fd that has already been closed. Typical bug 
causing this is accidentally closing an fd twice - and thus, if you are 
unlucky, accidentally the second time closing the fd when it actually 
has already managed to get reused and thus is open agian. Oops.

But it seems unreasonable to think that *such* a bug exists in this code 
after all these years. And I am simply guessing its the >1024 fd problem 
biting us, but yeah, we don't know. And I also guess it was that SAML 
issue, in combination with raised ulimits, that made us even get over 1024.

Things we are planning:

- Come up with a test case showing it blowing up. Will try to do that 
next week.
- Start looking at how to use poll() or epoll() instead, because we need 
to be SOLID here and we can't afford the 1024 limit.

So... anyone interested? Any brilliant thoughts? AFAICT we can rest 
assured that there is a bug here somewhere, because otherwise we 
wouldn't be able to get "errno 9 Bad file descriptor", right?

regards, Göran

PS. Googling this in relation to the Squeak VM didn't show any recent 
hits, only some from Stephen Pair fixing a bug in X11 code etc.


More information about the Vm-dev mailing list