[Vm-dev] Squeak socket problem ... help!

Fri Oct 10 09:46:40 UTC 2014

Hi guys!

Long post but an interesting one. Summary for the impatient:

- Ron seems to prove that:
	- We never reach 1024 fds so no leak.
	- But select() still fails, so we probably have an obscure close() bug 
with fds.

- And I prove that select() is indeed broken above 1024.
- And some other interesting details like why "Socket new" creates *two* 
instances of Socket etc.

Details follows:

On 10/10/2014 05:56 AM, Ron Teitelbaum wrote:
> Hi All!
>
> Thanks for your help!!
>
> Trying to reproduce the error I ran 1000 clients against our test cluster of
> 7 servers.

Hehe, this is fairly interesting in itself - Ron is firing up a whole 
slew of machines on EC2 and is in fact running 1000 Terf clients on ... 
50 servers? I think, and they actually run under Wine in Ubuntu :)

> The fd size hovers around 250 and never made it above 490 (there were 4
> service providers so the main server should only handle around 250 people).

Very interesting. Thus, this is NOT the 1024 limitation of select() 
then, nor are we leaking. Unfortunately - since it implies a harder bug 
to find.

> David, Thanks for the pointers on how to find the fd size.  Very cool.
> Since the fd size seemed to be well within proper limits and went both up
> and down I think you are correct we may be looking at the wrong thing.

Well, David seems to have been both right and wrong - I guessed the 
wrong thing, its probably not the 1024-limit of select() that is the 
*cause* of our problem - but its not a leak, more likely an obscure 
close bug. Right? :)

NOTE: After writing this whole post I am not sure. Given that I can 
reproduce the errorno 9 spewage by simply crossing the 1024 limit (see 
below) it would still be nice to verify that is NOT what is happening 
when you throw 1000 users at it. You did check the fd count using 
David's instructions, can we be sure that we are way below?

Select() *is* broken
====================

Further, just so we don't drop this ball, there *is* a problem with 
select() if we raise the ulimit above 1024. If I first run this trivial 
code (Apache answers on localhost:80) with the default limit of 1024:

| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1040) collect: [:n | Socket new connectTo: local port: 80]

And do an inspect on that - it will fail when it gets to 1017 (some 
descriptors already taken I presume) due to the default ulimit of 1024 - 
as expected. It fails a bit inconsistently with different errors, but 
anyway, nothing really surprising.

So... then I raised my limits so that "ulimit -S" reports 2048 for my 
user. Then tried again, and behold!!! When I pass 1017 my stderr starts 
spewing out:

errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor

...continuously, so yes, it fails on every aioPoll().

And in Squeak there is a debugger that pops up about 45 seconds later 
(timeout) with a connection timeout - also expected - since the next 
Socket to open doesn't connect since the select() isn't working anymore 
so Squeak never gets notified about the connection being made.

Now that the errno 9 is spewing continuosly the VM is still running and 
Squeak "works". But if I try to connect somewhere - I just timeout.

Testing in Pharo 2.0 with pharo VM
==================================

So let's try in Pharo 2.0. Test code:

| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1017) collect: [:n | Socket new connectTo: local port: 80]

Note that 1016 works, over and over after a Smalltalk garbageCollect, no 
problem. But if we try it with 1017 we get a smack in the face:

./pharo shared/Pharo2.0.image
*** buffer overflow detected ***: /home/gokr/pharo/pharo2.0/bin/pharo 
terminated
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x696de)[0xf757f6de]
/lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x6b)[0xf761213b]
/lib/i386-linux-gnu/libc.so.6(+0xfafca)[0xf7610fca]
/lib/i386-linux-gnu/libc.so.6(+0xfc02a)[0xf761202a]
/home/gokr/pharo/pharo2.0/bin/pharo(aioEnable+0x31)[0x80978d1]
/home/gokr/pharo/pharo2.0/bin/pharo(sqSocketConnectToPort+0xd6)[0x80f38b6]
/home/gokr/pharo/pharo2.0/bin/pharo[0x80f1518]
[0xb70bd770]
[0xb716fbd9]
[0xb716fb16]
[0xb716d53a]
[0xb70bd700]
[0xb70bd5c0]
.... etc

That's interesting? So trying again with the latest Cog from 
mirandabanda...

Testing in Pharo 2.0 with Cog from Eliot
========================================

Aha, so here we first need to execute (too few semaphores, only 256 and 
I think we need about 3 per Socket?):

VirtualMachine maxExternalSemaphoresSilently: 8192

...and then we get the same behavior as using our oldish Cog, namely it 
starts spewing out the "errno 9 select: Bad file descriptor" over and 
over and the next Socket connection times out.

So... back to you Eliot - am I still wrong? :) :)

Extra Oddity
============

Socket allInstances size => 2038
(Socket allInstances select: [:e | e isConnected]) size => 2032
(Socket allInstances collect: [:e | e socketHandle]) asSet size => 1019

...what? After some investigation it turns out that "Socket new" will 
create *two* instances of Socket. This has to do with WeakRegistry used 
in the Socket class :)

executor
	"Return an object which can act as executor for finalization of the 
receiver"
	^self shallowCopy actAsExecutor

...so each Socket gets a shallow copy of itself as the executor. Ok, 
that explains it - but IMHO it sure ain't pretty.

> Could be that we are closing a socket twice?  (see 0xffa5e0cc M
> Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
> that said on a very busy server you may try closing an already closed socket
> that had already been reused.  On not so busy system you can get away with
> it so you may not see this error until you put a very large load on it.
>
> I captured the error:
>
> This repeated about 1000 times (no idea why it would repeat so many times):
>
> errno 9
> select: Bad file descriptor

It repeats because we call select() many times per second and the bad 
(closed) fd is still in that set we send in, so it barfs every time and 
since it barfs it doesn't do its job anymore so this bad fd is blocking 
everything - and once this hits the VM goes unresponsive when it comes 
to file handles.

I guess one could "clean out" the bad fd if this happens - but that is 
of course a hack and not sure how to do it.

Then exactly why the segfault comes, who knows. :)

And oh, the SAML auth that we thought was leaking - no. It just uses 
WebClient httpGet and doesn't hold onto anything. Some trivial tests 
shows that they should be GCed properly - although yes, John's remark 
about GC coming "later" is also interesting.

regards, Göran