[Vm-dev] Squeak socket problem ... help!

Fri Oct 10 00:50:24 UTC 2014

On Thu, Oct 9, 2014 at 5:09 PM, Göran Krampe <goran at krampe.se> wrote:

>
> Hi guys!
>
> On 10/10/2014 01:28 AM, David T. Lewis wrote:
>
>> Ron and I (3DICC) have a problem with the Unix VM networking and I am
>>>> reaching out before burning too many hours on something one of you
>>>> C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
>>>> trouble.
>>>>
>>>
>>> Cool.  This is likely easy to fix.  Your image is running out of file
>>> descriptors.  Track open and close calls, e.g. add logging around at
>>> least StandardFileStream>>#primOpen:writable:
>>> , AsyncFile>>#primOpen:forWrite:semaIndex:,
>>> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:
>>> semaIndex:readSemaIndex:writeSemaIndex:
>>> and their associated close calls and see what's being opened without
>>> being
>>> closed.  It shoudl be easy to track=down, but may be more difficult to
>>> fix.
>>>
>>> Good luck!
>>>
>>
> Aha. Soo... am I understanding this correctly - we are probably leaking
> fds and when we go above 1024 this makes select() go bonkers and eventually
> leads to the "Bad file descriptor" error?

I'm not sure, but that you're needing to up the per-process file handle
limit is worrying.  That you should diagnose first.  If you solve it (and
its likely to be easy; you can do things like maintain sets of open
sockets, etc, and close the least recently used when reaching some
high-tide mark, etc) then I suspect the select problems will go away too.

 I agree with what Eliot is saying and would add a few thoughts:
>>
>
>  - Don't fix the wrong problem (DFtWP). Unless you have some reason to
>> believe that this server application would realistically have a need to
>> handle anything close to a thousand concurrent TCP sessions, don't fix
>> it by raising the per-process file handle limit, and don't fix it by
>> reimplementing the socket listening code.
>>
>
> We haven't done the exact numbers, but we could probably hit several
> hundreds concurrent at least. 1024 seemed a bit "over the top" though :)
>

But each connexion could have a few sockets, and then there may be file
connexions etc.  Best look see what you have there and sanity-check.

>
> The system in question is meant to serve more than 1000 concurrent users,
> so we are in fact moving into this territory. We have been up to around 600
> so far.
>
>  - It is entirely possible that no one before you has ever tried to run
>> a server application with the per-process file handle limit bumped up
>> above the default 1024. So if that configuration does not play nicely
>> with the select() mechanism, you may well be the first to have encountered
>> this as an issue. But see above, don't fix it if it ain't broke.
>>
>
> Well, it most probably *is* broke - I mean - I haven't read anywhere that
> our Socket code is limited to 1024 concurrent sockets and that going above
> that limit causes the Socket code to stop working? :)
>

"Broke" is the wrong word.  You're running into a soft resource limit that
you can raise.  But you should only raise it if you know the server code is
correct, because raising the limit when incorrect code causes a gradual
increase in open file descriptors will simply postpone the inevitable.  And
the longer the server appears to run healthily the more mysterious and
annoying the crash may appear :-).

> But I agree - I don't want to touch that code if we can simply avoid this
> bug by making sure we stay below 1024.
>

You may have to go above 1024 when you have >= 1024 users (or 512 or what
ever). But you should understand the relationship between connexions and
open file descriptors and know that there are no leaks before you up the
limit.

But it sounds broke to me, nevertheless. ;)

If it is broke it is the OS that is broke.  But not really.  Having a soft
limit is a great way to find problems (leaks) while providing flexibility.
With no limit the system runs until catastrophe, probably bringing the OS
down with it.  This is a bit like Smalltalk catching infinite recursion
through the low space condition; by the time the recursion is stopped
there's precious little resource to do anything about it (not to mention
that the stack will be huge).  So only lift the limit when you know the
code is correct and you know you need more resources to serve the
anticipated number of clients.

 - Most "out of file descriptor" problems involve resource leaks (as Eliot
>> is suggesting), and in those cases you will see a gradual increase in file
>> descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
>> descriptors and something horrible happens.
>>
>
> We will start looking at that and other tools too.
>
>
>  - Sorry to repeat myself but this is by far the most important point:
>> DFtWP.
>>
>
> Sure :). This is why I posted - to get your input. And I have a suspicion
> that the SAML issue I mentioned may be the code leaking, we will start
> looking.
>
> regards, Göran
>

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20141009/6452b214/attachment.htm