<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 9, 2014 at 5:09 PM, Göran Krampe <span dir="ltr">&lt;<a href="mailto:goran@krampe.se" target="_blank">goran@krampe.se</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Hi guys!<span class=""><br>

<br>

On 10/10/2014 01:28 AM, David T. Lewis wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Ron and I (3DICC) have a problem with the Unix VM networking and I am<br>

reaching out before burning too many hours on something one of you<br>

C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your<br>

trouble.<br>

</blockquote>

<br>

Cool.  This is likely easy to fix.  Your image is running out of file<br>

descriptors.  Track open and close calls, e.g. add logging around at<br>

least StandardFileStream&gt;&gt;#primOpen:<u></u>writable:<br>

, AsyncFile&gt;&gt;#primOpen:forWrite:<u></u>semaIndex:,<br>

Socket&gt;&gt;#primAcceptFrom:<u></u>receiveBufferSize:sendBufSize:<u></u>semaIndex:readSemaIndex:<u></u>writeSemaIndex:<br>

and their associated close calls and see what&#39;s being opened without being<br>

closed.  It shoudl be easy to track=down, but may be more difficult to fix.<br>

<br>

Good luck!<br>

</blockquote></blockquote>

<br></span>

Aha. Soo... am I understanding this correctly - we are probably leaking fds and when we go above 1024 this makes select() go bonkers and eventually leads to the &quot;Bad file descriptor&quot; error?</blockquote><div><br></div><div>I&#39;m not sure, but that you&#39;re needing to up the per-process file handle limit is worrying.  That you should diagnose first.  If you solve it (and its likely to be easy; you can do things like maintain sets of open sockets, etc, and close the least recently used when reaching some high-tide mark, etc) then I suspect the select problems will go away too.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I agree with what Eliot is saying and would add a few thoughts:<br>

</blockquote>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

- Don&#39;t fix the wrong problem (DFtWP). Unless you have some reason to<br>

believe that this server application would realistically have a need to<br>

handle anything close to a thousand concurrent TCP sessions, don&#39;t fix<br>

it by raising the per-process file handle limit, and don&#39;t fix it by<br>

reimplementing the socket listening code.<br>

</blockquote>

<br></span>

We haven&#39;t done the exact numbers, but we could probably hit several hundreds concurrent at least. 1024 seemed a bit &quot;over the top&quot; though :)<br></blockquote><div><br></div><div>But each connexion could have a few sockets, and then there may be file connexions etc.  Best look see what you have there and sanity-check.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The system in question is meant to serve more than 1000 concurrent users, so we are in fact moving into this territory. We have been up to around 600 so far.<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

- It is entirely possible that no one before you has ever tried to run<br>

a server application with the per-process file handle limit bumped up<br>

above the default 1024. So if that configuration does not play nicely<br>

with the select() mechanism, you may well be the first to have encountered<br>

this as an issue. But see above, don&#39;t fix it if it ain&#39;t broke.<br>

</blockquote>

<br></span>

Well, it most probably *is* broke - I mean - I haven&#39;t read anywhere that our Socket code is limited to 1024 concurrent sockets and that going above that limit causes the Socket code to stop working? :)<br></blockquote><div><br></div><div>&quot;Broke&quot; is the wrong word.  You&#39;re running into a soft resource limit that you can raise.  But you should only raise it if you know the server code is correct, because raising the limit when incorrect code causes a gradual increase in open file descriptors will simply postpone the inevitable.  And the longer the server appears to run healthily the more mysterious and annoying the crash may appear :-).</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

But I agree - I don&#39;t want to touch that code if we can simply avoid this bug by making sure we stay below 1024.<br></blockquote><div><br></div><div>You may have to go above 1024 when you have &gt;= 1024 users (or 512 or what ever). But you should understand the relationship between connexions and open file descriptors and know that there are no leaks before you up the limit.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">But it sounds broke to me, nevertheless. ;)</blockquote><div><br></div><div>If it is broke it is the OS that is broke.  But not really.  Having a soft limit is a great way to find problems (leaks) while providing flexibility.  With no limit the system runs until catastrophe, probably bringing the OS down with it.  This is a bit like Smalltalk catching infinite recursion through the low space condition; by the time the recursion is stopped there&#39;s precious little resource to do anything about it (not to mention that the stack will be huge).  So only lift the limit when you know the code is correct and you know you need more resources to serve the anticipated number of clients.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

- Most &quot;out of file descriptor&quot; problems involve resource leaks (as Eliot<br>

is suggesting), and in those cases you will see a gradual increase in file<br>

descriptors in /proc/&lt;vmpid&gt;/fd/ over time. Eventually you run out of<br>

descriptors and something horrible happens.<br>

</blockquote>

<br></span>

We will start looking at that and other tools too.<span class=""><br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

- Sorry to repeat myself but this is by far the most important point: DFtWP.<br>

</blockquote>

<br></span>

Sure :). This is why I posted - to get your input. And I have a suspicion that the SAML issue I mentioned may be the code leaking, we will start looking.<br>

<br>

regards, Göran<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>best,<div>Eliot</div>

</div></div>