[Seaside] Squeak async i/o performrance, was Re: HTTP Performance

Thu Nov 20 16:36:35 UTC 2003

Ok...maybe I'm crazy (or maybe this exercise is making me crazy)...but I 
now have the following data points to add:

I ran the following test "ab -n 1024 -c 16 <url>" against both comanche 
and apache on both windows and linux.

----
Windows:
ab running from a Linux 1Ghz PIII laptop to a Windows 1.8Ghz PII 
desktop...apache serving a very small hello world file, comanche serving 
a hello world string (in memory):

Apache 2.0.45: scored consistently between 92 and 131 rps (no failed 
requests)
Comanche 6.2 with priority set to 80: erratic between 100 and 1500 rps 
with varying number of failed requests
----
Linux:
ab running from a Linux 1Ghz PIII laptop to a server on the same 
maching...apache serving a very small hello world file, comanche serving 
a hello world string (in memory):

Apache 1.3.29: scored consistently around 240 rps (no failed requests)
Comanche 6.2 with priority set to 80: scored consistently around 230 rps 
(no failed requests)
----

What does this mean?  Well, for one, run your comanche server at a high 
priority.  I think it also means we don't really have the performance 
issues we thought we had.  Also, based on the fact that we got lots of 
failed request and erratic performance (sometime good, sometimes 
excellent) on the Windows VM, I'd say there's something quirky in the 
Windows socket code.  On Unix, Comanche never has any failed requests 
and performance is not erratic from one test to the next (so run 
comanche servers on the unix VM).  And, finally, if there isn't 
something confounding my tests, it means that Comanche's performance is 
roughly equal to Apache when you take file I/O out of the equation and 
bump its priority up to 80 (apache is most likely caching the test file 
in memory).

- Stephen

Andreas Raab wrote:

>>This test really only measures socket connection performance.
>>    
>>
>
>Which is 99% bounded by the interface between VM and image. Precisely the
>thing that I wanted to measure. So if there's latency between VM and image
>then this latency should become visible here (note that both connect as well
>as accept require state changes on the socket).
>
>  
>
>>My guess is that something is causing an excessive latency between
>>the time a socket semaphore is marked for signaling in the VM and
>>when the waiting squeak process actually wakes up.
>>    
>>
>
>That's possible but not likely according to the results I got. First of all,
>the results show that no matter how hard you try you can't get above 1k
>connections per second (bounded by the time it takes to create and destroy a
>socket). Secondly, if there were latency involved, then it should be visible
>by reduced CPU usage - however, the CPU maxes out during the test which is a
>clear indication that Squeak is doing "something". Considering that during
>this test, the script is run in the main process there is no UI
>interference. Given that there were no other "interesting" processes active
>it is reasonable to assume that all of the (maxed out CPU) time is spent
>within Squeak. Which means within the primitives observed by the tally which
>means that latency is probably not the limiting factor. Q.E.D. ;-)
>
>And of course, all of this is on Windows. YMMV.
>
>Cheers,
>  - Andreas
>
>  
>
>>-----Original Message-----
>>From: seaside-bounces at lists.squeakfoundation.org 
>>[mailto:seaside-bounces at lists.squeakfoundation.org] On Behalf 
>>Of Stephen Pair
>>Sent: Thursday, November 20, 2003 3:53 PM
>>To: The general-purpose Squeak developers list
>>Cc: 'The Squeak Enterprise Aubergines Server - general discussion.'
>>Subject: Re: [Seaside] Squeak async i/o performrance, was Re: 
>>HTTP Performance
>>
>>
>>This test really only measures socket connection performance.  RPS as 
>>measured by ab (apache benchmark tool) is a full HttpRequest and 
>>HttpResponse.  So they are quite different things than what this test 
>>measures.  The ab tests we were running only open a relatively few 
>>socket connections (I was using 4, 8, 16, 32, etc).  The number of 
>>simultaneous socket connections seems to have little effect 
>>on the HTTP 
>>RPS throughput.  Measuring HTTP rps on both inbound connections (to 
>>comanche) and outbound (to an apache server) yield similar results in 
>>terms of RPS...even though ab can hit apache with over 1000 
>>RPS, squeak 
>>can only hit apache with ~50 RPS...and, we seem to spend a 
>>lot of time 
>>(~70%) during these benchmarks waiting on socket semaphores.  
>>My guess 
>>is that something is causing an excessive latency between the time a 
>>socket semaphore is marked for signaling in the VM and when 
>>the waiting 
>>squeak process actually wakes up.
>>
>>- Stephen
>>
>>Andreas Raab wrote:
>>
>>    
>>
>>>Hi Bruce,
>>>
>>> 
>>>
>>>      
>>>
>>>>I poked at this today and I don't quite get the math to work.
>>>>   
>>>>
>>>>        
>>>>
>>>Interesting you should say this. My impression after running 
>>>      
>>>
>>a bunch of
>>    
>>
>>>tests with Avi and Stephen yesterday was very similar. I can 
>>>      
>>>
>>say for sure
>>    
>>
>>>that there's something wrong here but it's not quite clear 
>>>      
>>>
>>what. I've been
>>    
>>
>>>running an interesting little benchmark which might be good 
>>>      
>>>
>>to run on other
>>    
>>
>>>machines as well. Here it is:
>>>
>>>"--- the server (run in one image) -----"
>>>server := Socket newTCP.
>>>server listenOn: 12345 backlogSize: 4.
>>>
>>>"wait for the first connection (this allows us to profile both ends)"
>>>[client := server waitForAcceptFor: 10 ifTimedOut:[nil].
>>>client == nil] whileTrue.
>>>client destroy.
>>>
>>>"Now do 1k connections"
>>>MessageTally spyOn:[
>>>	1 to: 1000 do:[:i|
>>>		client := server waitForAcceptFor: 10 ifTimedOut:[nil].
>>>		client destroy.
>>>	].
>>>].
>>>"clean up"
>>>server destroy.
>>>
>>>
>>>"--- the client (run in another one) ---"
>>>addr := NetNameResolver addressForName: '127.0.0.1'.
>>>
>>>"wait for the first connection (to sync with server)"
>>>[client := Socket newTCP.
>>>[client connectTo: addr port: 12345] on: ConnectionTimedOut 
>>>      
>>>
>>do:[:ex| ex
>>    
>>
>>>return].
>>>client isConnected] whileFalse:[client destroy].
>>>
>>>"Now do 1k connections"
>>>MessageTally spyOn:[
>>>	client := Socket newTCP.
>>>	1 to: 1000 do:[:i|
>>>		client := Socket newTCP.
>>>		client connectTo: addr port: 12345.
>>>		client destroy.
>>>	].
>>>].
>>>
>>>Running this shows profiles on both server and client side as well as
>>>(indirectly) the number of connections established. It turns 
>>>      
>>>
>>out that this
>>    
>>
>>>runs in ~4secs on my machine which makes for about 250 
>>>      
>>>
>>connections per
>>    
>>
>>>second. Looking closer, it looked as if much of the time is 
>>>      
>>>
>>spent in socket
>>    
>>
>>>creation and destroy (which admittedly is a heavy-weight operation on
>>>Windows) and measuring it shows this clearly:
>>>
>>>[1 to: 1000 do:[:i| Socket newTCP destroy]] timeToRun
>>>
>>>This takes about a second (after some warmup time ;) which 
>>>      
>>>
>>means that we
>>    
>>
>>>need about 1ms for creating and destroying a single socket. 
>>>      
>>>
>>Since the above
>>    
>>
>>>was run on a single machine it means that 50% of the time in 
>>>      
>>>
>>the benchmark
>>    
>>
>>>is spent in socket creation/destroy (2k sockets used which 
>>>      
>>>
>>means 2secs
>>    
>>
>>>overhead out of the 4secs the benchmark took).
>>>
>>>Definitely something to improve but it still doesn't say why Comanche
>>>doesn't get more than 50rps.
>>>
>>>Cheers,
>>> - Andreas
>>>
>>> 
>>>
>>>      
>>>
>>>>-----Original Message-----
>>>>From: seaside-bounces at lists.squeakfoundation.org 
>>>>[mailto:seaside-bounces at lists.squeakfoundation.org] On Behalf 
>>>>Of Bruce ONeel
>>>>Sent: Thursday, November 20, 2003 1:58 PM
>>>>To: The Squeak Enterprise Aubergines Server - general discussion.
>>>>Cc: squeak-dev at lists.squeakfoundation.org
>>>>Subject: [Seaside] Squeak async i/o performrance, was Re: 
>>>>HTTP Performance
>>>>
>>>>
>>>>Avi Bryant <avi at beta4.com> wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Like Stephen, I'm intrigued by your 50rps figure for 
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>Comanche - that's 
>>>>   
>>>>
>>>>        
>>>>
>>>>>the number that I always seem to come up with as well.  
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>There *must* be 
>>>>   
>>>>
>>>>        
>>>>
>>>>>something throttling that, and we need to get to the bottom of it.
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>Hi,
>>>>
>>>>I poked at this today and I don't quite get the math to work.
>>>>I looked at unix Squeak 3.4-1 so that might distort things.
>>>>
>>>>The basis of the clock in interp.c seems to be in milisecs.  I'm
>>>>getting idea from comments and the definition of ioiMsecs in 
>>>>sqXWindow.c.  So, given that:
>>>>
>>>>checkForInterrupts (interp.c) is called approx every 1000 byte codes
>>>>executed.  If it hasn't been 3 milisecs since the last call this
>>>>1000 is scalled up.   If it's been more it is scalled down.  We seem
>>>>to be trying to check for interrupts every 3 milisecs.  This is
>>>>controlled by interruptChecksEveryNms.
>>>>
>>>>Once this happens if it has been 500 milisecs (this seems high)
>>>>after the last time we called checkForInterrupts we call
>>>>ioProcessEvents().
>>>>This comes from the bits in checkForInterrupts()
>>>>if (now >= nextPollTick) {
>>>>	ioProcessEvents();
>>>>	nextPollTick = now + 500;
>>>>}
>>>>
>>>>ioProcessEvents (sqXWindow.c)  calls aioPoll.
>>>>
>>>>aioPoll (aio.c) calls select and then calls the aio handler for 
>>>>each descriptor which has pending i/o waiting.
>>>>
>>>>sqSocketListenOnPortBacklogSize has set up the socket with a listen
>>>>and then enabled an aio handler.
>>>>
>>>>TcpListener has created a socket with a backlog default of 10.
>>>>
>>>>This implies that every 500 milisecs (ie, 2 times per sec) we can 
>>>>get 10 accepts for 20 connections per sec when processing
>>>>is happening.
>>>>
>>>>On top of that events are polled when we call 
>>>>relinquishProcessorForMicroseconds which is called by the idle
>>>>process.  This would be called with a wait of 1 milisec 
>>>>(1000 microsecs).  I'm not sure how often this is called and it
>>>>depends on the load in Squeak as well.
>>>>
>>>>I guess then that the additional 30 connections per sec
>>>>would be accepted during the idle process running.
>>>>
>>>>It would seem then that a higher backlog in TcpListener might
>>>>give higher connections per second.  If someone has a test bed
>>>>set up which shows the 50 connections per sec, what happens if
>>>>you doIt
>>>>
>>>>TcpListener backLogSize: 1000
>>>>
>>>>do things work?  Do you get more connections pe sec?  Do things
>>>>hang?  
>>>>
>>>>I'm obviously not sure the above is correct.  With luck someone
>>>>with a chance of knowing will speak up :-)
>>>>
>>>>cheers
>>>>
>>>>bruce
>>>>