[Vm-dev] Squeak socket problem ... help!
Göran Krampe
goran at krampe.se
Fri Oct 10 09:46:40 UTC 2014
Hi guys!
Long post but an interesting one. Summary for the impatient:
- Ron seems to prove that:
- We never reach 1024 fds so no leak.
- But select() still fails, so we probably have an obscure close() bug
with fds.
- And I prove that select() is indeed broken above 1024.
- And some other interesting details like why "Socket new" creates *two*
instances of Socket etc.
Details follows:
On 10/10/2014 05:56 AM, Ron Teitelbaum wrote:
> Hi All!
>
> Thanks for your help!!
>
> Trying to reproduce the error I ran 1000 clients against our test cluster of
> 7 servers.
Hehe, this is fairly interesting in itself - Ron is firing up a whole
slew of machines on EC2 and is in fact running 1000 Terf clients on ...
50 servers? I think, and they actually run under Wine in Ubuntu :)
> The fd size hovers around 250 and never made it above 490 (there were 4
> service providers so the main server should only handle around 250 people).
Very interesting. Thus, this is NOT the 1024 limitation of select()
then, nor are we leaking. Unfortunately - since it implies a harder bug
to find.
> David, Thanks for the pointers on how to find the fd size. Very cool.
> Since the fd size seemed to be well within proper limits and went both up
> and down I think you are correct we may be looking at the wrong thing.
Well, David seems to have been both right and wrong - I guessed the
wrong thing, its probably not the 1024-limit of select() that is the
*cause* of our problem - but its not a leak, more likely an obscure
close bug. Right? :)
NOTE: After writing this whole post I am not sure. Given that I can
reproduce the errorno 9 spewage by simply crossing the 1024 limit (see
below) it would still be nice to verify that is NOT what is happening
when you throw 1000 users at it. You did check the fd count using
David's instructions, can we be sure that we are way below?
Select() *is* broken
====================
Further, just so we don't drop this ball, there *is* a problem with
select() if we raise the ulimit above 1024. If I first run this trivial
code (Apache answers on localhost:80) with the default limit of 1024:
| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1040) collect: [:n | Socket new connectTo: local port: 80]
And do an inspect on that - it will fail when it gets to 1017 (some
descriptors already taken I presume) due to the default ulimit of 1024 -
as expected. It fails a bit inconsistently with different errors, but
anyway, nothing really surprising.
So... then I raised my limits so that "ulimit -S" reports 2048 for my
user. Then tried again, and behold!!! When I pass 1017 my stderr starts
spewing out:
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
...continuously, so yes, it fails on every aioPoll().
And in Squeak there is a debugger that pops up about 45 seconds later
(timeout) with a connection timeout - also expected - since the next
Socket to open doesn't connect since the select() isn't working anymore
so Squeak never gets notified about the connection being made.
Now that the errno 9 is spewing continuosly the VM is still running and
Squeak "works". But if I try to connect somewhere - I just timeout.
Testing in Pharo 2.0 with pharo VM
==================================
So let's try in Pharo 2.0. Test code:
| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1017) collect: [:n | Socket new connectTo: local port: 80]
Note that 1016 works, over and over after a Smalltalk garbageCollect, no
problem. But if we try it with 1017 we get a smack in the face:
./pharo shared/Pharo2.0.image
*** buffer overflow detected ***: /home/gokr/pharo/pharo2.0/bin/pharo
terminated
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x696de)[0xf757f6de]
/lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x6b)[0xf761213b]
/lib/i386-linux-gnu/libc.so.6(+0xfafca)[0xf7610fca]
/lib/i386-linux-gnu/libc.so.6(+0xfc02a)[0xf761202a]
/home/gokr/pharo/pharo2.0/bin/pharo(aioEnable+0x31)[0x80978d1]
/home/gokr/pharo/pharo2.0/bin/pharo(sqSocketConnectToPort+0xd6)[0x80f38b6]
/home/gokr/pharo/pharo2.0/bin/pharo[0x80f1518]
[0xb70bd770]
[0xb716fbd9]
[0xb716fb16]
[0xb716d53a]
[0xb70bd700]
[0xb70bd5c0]
.... etc
That's interesting? So trying again with the latest Cog from
mirandabanda...
Testing in Pharo 2.0 with Cog from Eliot
========================================
Aha, so here we first need to execute (too few semaphores, only 256 and
I think we need about 3 per Socket?):
VirtualMachine maxExternalSemaphoresSilently: 8192
...and then we get the same behavior as using our oldish Cog, namely it
starts spewing out the "errno 9 select: Bad file descriptor" over and
over and the next Socket connection times out.
So... back to you Eliot - am I still wrong? :) :)
Extra Oddity
============
Socket allInstances size => 2038
(Socket allInstances select: [:e | e isConnected]) size => 2032
(Socket allInstances collect: [:e | e socketHandle]) asSet size => 1019
...what? After some investigation it turns out that "Socket new" will
create *two* instances of Socket. This has to do with WeakRegistry used
in the Socket class :)
executor
"Return an object which can act as executor for finalization of the
receiver"
^self shallowCopy actAsExecutor
...so each Socket gets a shallow copy of itself as the executor. Ok,
that explains it - but IMHO it sure ain't pretty.
> Could be that we are closing a socket twice? (see 0xffa5e0cc M
> Socket>destroy 0xd9ba02d4: a(n) Socket) Göran mentioned finding something
> that said on a very busy server you may try closing an already closed socket
> that had already been reused. On not so busy system you can get away with
> it so you may not see this error until you put a very large load on it.
>
> I captured the error:
>
> This repeated about 1000 times (no idea why it would repeat so many times):
>
> errno 9
> select: Bad file descriptor
It repeats because we call select() many times per second and the bad
(closed) fd is still in that set we send in, so it barfs every time and
since it barfs it doesn't do its job anymore so this bad fd is blocking
everything - and once this hits the VM goes unresponsive when it comes
to file handles.
I guess one could "clean out" the bad fd if this happens - but that is
of course a hack and not sure how to do it.
Then exactly why the segfault comes, who knows. :)
And oh, the SAML auth that we thought was leaking - no. It just uses
WebClient httpGet and doesn't hold onto anything. Some trivial tests
shows that they should be GCed properly - although yes, John's remark
about GC coming "later" is also interesting.
regards, Göran
More information about the Vm-dev
mailing list