Image freezing problem

Thu Jul 26 21:40:08 UTC 2007

On Jul 26, 2007, at 19:28 , Andreas Raab wrote:

>> For tracing, I added the TraceMessageSends instrumentation from  
>> John and extended it with some more data. (BTW, I vote to  
>> integrate this by default as it proved, with the signaling  
>> facility, very useful!)
>
> I think we need a "production Squeak" list or somesuch. We have a  
> few more of these things in our images and VMs that are immensely  
> helpful for tracking down issues like this and that sounds just  
> like another one that a production server really needs to have.

yep, I agree

>> The following message trace was started when the image got stuck  
>> and it was stopped again a few seconds after Squeak continued  
>> working normal again:
>> http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)
>> The extended format of the trace is: proc, ctxt, statIncrGCs,  
>> statTenures, statFullGCs, ioMSecs
>>  From this trace, the problem looks to be explainable as follows:
>> - The time elapsed between two calls to  
>> WorldState>interCyclePause: is longer than 20ms. The cause for  
>> this is not the UI process using up a lot of time, but that there  
>> are 3 incr. GCs consuming 6 to 8ms each within each UI cycle.
>
> Is this a low-end server? 6-8ms is a *lot* of time being spent in  
> IGC - we typically see 6-8ms only after increasing the default GC

its a quite old server with 2 x 1.4 Ghz CPUs.

> parameters (# of allocations between GCs, various thresholds) by  
> roughly a factor of ten. Also, the above means that effectively all  
> or almost all of the time is spent in those GCs which seems doubtful

Let's see. This is an excerpt from the very beginning of the trace,  
showing one UI cycle. No other process was active in this period.
The numbers indicate: proc, ctxt, statIncrGCs, statTenures,  
statFullGCs, ioMSecs

1145108440 1189307120 58988 5419 106 796524 WorldState>interCyclePause:
....
1145108440 1189307696 58988 5419 106 796524  
EventSensor>nextEventFromQueue
1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty
....
1145108440 1189304704 58989 5419 106 796537  
UserInputEvent>transformedBy:
1145108440 1189304084 58990 5419 106 796544  
UserInputEvent>transformedBy:
....
1145108440 1189307680 58990 5419 106 796547  
HandMorph>noticeMouseOver:event:
1145108440 1189303900 58991 5419 106 796555  
MouseOverHandler>noticeMouseOver:event:
...
1145108440 1189307120 58991 5419 106 796556 WorldState>interCyclePause:

- The elapsed time between the start and end of the cycle is 796556 -  
796524 = 32
- There has been 3 IGCs consuming: 7 + 7 + 8 = 22

Now, the last conclusion may be debatable as I just looked at the  
difference between the clock value of the two messages between which  
the statIncrGCs was increased.

> unless you have a busy UI (we had this problem once when we left a  
> Transcript open, which is a terribly bad idea on a server image).

no no, the UI is not busy, a workspace is open and maybe a code  
browser (indeed, leaving the Transcript open is deadly).

[...]

>> How could we fix this?
>> -----------------------------------------
>> a) Simply increase the 20ms pause defined by MinCycleLapse (at  
>> least for production systems) or tweak the "pause logic". As a  
>> test I defined MinCycleLapse to be 40ms. I could not reproduce the  
>> problem anymore.
>> b) In production, suspend the UI process and only turn it on again  
>> when you need it (we do this via HTTP). This should also improve  
>> performance a bit. At best this is a workaround.
>> c) Tune the GC policies as they are far from optimal for today's  
>> systems (as John has suggested a couple of times). It seems,  
>> though, that this cannot guarantee to fix the problem but it  
>> should make it less likely to happen(?).
>
> d) Don't use processes that run below user scheduling priority. To  
> be honest, I'm not sure why you'd be running anything below UI  
> priority on a server.

hm, Kom starts processes with priority 30, I guess to keep the UI  
responsive. The long running process I have here should be of lower  
priority than the process handling the Seaside requests. Actually, I  
don't care how long it takes, as long as it does not slow down or  
even block the rest.

> e) Make a preference "lowerPerformance" (or call it "headlessMode"  
> if you wish :^) and have the effect be that in intercyclePause: you  
> *always* wait for a certain amount of time (20-50ms). This will  
> ensure that your UI process can't possibly eat up all of your CPU.

yes, sounds like a reasonable fix (apart from that we have yet  
another preference lookup in the UI loop...)

Cheers,
Adrian

>
>> I'd be interested in getting feedback on:
>> - whether the explanation sounds plausible
>
> It does. There is however a question what could possibly generate  
> enough load on the garbage collector to run IGCs that take on  
> average 7ms and run three of them in a single UI cycle.
>
>> - whether the fix (e.g., a)) workes for other people that have  
>> this problem.
>> - what may be a good fix
>
> I'd probably go with option e) above since it ensures that there is  
> always time left for the lower priority processes (and you don't  
> have to change other code). Everything else seems risky since you  
> can't say for sure if there isn't anything that keeps the UI in a  
> busy-loop.
>
> Cheers,
>   - Andreas
>