[Vm-dev] Squeak socket problem ... help!

Eliot Miranda eliot.miranda at gmail.com
Thu Oct 9 22:27:05 UTC 2014


Hi Both,

On Thu, Oct 9, 2014 at 3:05 PM, Göran Krampe <goran at krampe.se> wrote:

>
> Hi guys!
>
> Long email but... work for hire here! :)
>
> In short:
>
> Ron and I (3DICC) have a problem with the Unix VM networking and I am
> reaching out before burning too many hours on something one of you
> C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> trouble.
>

Cool.  This is likely easy to fix.  Your image is running out of file
descriptors.  Track open and close calls, e.g. add logging around at
least StandardFileStream>>#primOpen:writable:
, AsyncFile>>#primOpen:forWrite:semaIndex:,
Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
and their associated close calls and see what's being opened without being
closed.  It shoudl be easy to track=down, but may be more difficult to fix.

Good luck!


> In looong:
>
> So... we are running a large deployment of Terf (yay!) that is "pushing"
> the server side VMs a bit. The large load has caused us to experience some
> issues.
>
>
> Our Possibly Faulted Analysis So Far
> ====================================
>
> One of our server side VMs, the one that unfortunately is a crucial
> resource, locks up and doesn't respond on its most important listening
> Socket port. VM does not crash however. We reboot it, because its a
> stressful situation with LOTS of users being affected, so we haven't looked
> "inside".
>
> Unfortunately the shell script starting the VMs wasn't catching stderr to
> a log file (Argh! Now it does though so we will see if we get more info
> later) so we have missed some info here but Ron "got lucky" and saw this on
> his terminal (stderr of the VM going to his terminal instead of log file):
>
> "errno 9
> select: Bad file descriptor"
>
> It took us quite some time before we realized this was indeed Squeak
> talking, and that it was from inside aio.c - a call from aioPoll():
>         perror("select")
>
> ...ok. Some important background info:
>
> Before this we hit the default ulimit of 1024 per user (duh!), causing
> "Too Many Files Open", so we raised them silly high. That did make the
> system handle itself - but in retrospect we think another issue (related to
> SAML auth) caused tons of client requests getting spawned from this VM and
> thus is what made us reach the limit in the first place. It may also have
> been the factor (=many many sockets) that in the end caused the errno 9
> described above - see below for reasoning.
>
> After perusing the IMHO highly confusing :) Socket code (no offense of
> course, its probably trivial to a Unix C-guru) at least we understand that
> the code uses select() and not a more modern poll() or epoll(). In fact
> there is also a call to select() in sqUnixSocket.c, but... probably not
> relevant.
>
> Yeah, epoll() is not portable etc, we know, but frankly we only care for
> Linux here.
>
> Googling shows us further that select() has issues, I mean, yikes. And the
> thing I think we might be hitting here is the fact that select() doesn't
> handle more than 1024 file descriptors!!! (as far as I can understand the
> writing on the Internet) and to make it worse, it seems to be able to go
> bananas if you push it there...
>
> Now... again, Internet seems to imply that the "usual" cause of "errno 9"
> is doing a select on an fd that has already been closed. Typical bug
> causing this is accidentally closing an fd twice - and thus, if you are
> unlucky, accidentally the second time closing the fd when it actually has
> already managed to get reused and thus is open agian. Oops.
>
> But it seems unreasonable to think that *such* a bug exists in this code
> after all these years. And I am simply guessing its the >1024 fd problem
> biting us, but yeah, we don't know. And I also guess it was that SAML
> issue, in combination with raised ulimits, that made us even get over 1024.
>
> Things we are planning:
>
> - Come up with a test case showing it blowing up. Will try to do that next
> week.
> - Start looking at how to use poll() or epoll() instead, because we need
> to be SOLID here and we can't afford the 1024 limit.
>
> So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> assured that there is a bug here somewhere, because otherwise we wouldn't
> be able to get "errno 9 Bad file descriptor", right?
>
> regards, Göran
>
> PS. Googling this in relation to the Squeak VM didn't show any recent
> hits, only some from Stephen Pair fixing a bug in X11 code etc.
>



-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20141009/73cde408/attachment.htm


More information about the Vm-dev mailing list