Image freezing problem

Andreas Raab andreas.raab at gmx.de
Thu Jul 26 17:28:30 UTC 2007


Adrian Lienhard wrote:
> When seeing Andreas' delay issue, it looked just obvious that it's 
> exactly this same problem. But the fix does not resolve it.

I didn't expect it to. If the Delay problem hits, VNC connections 
typically no longer work (which makes the mystery all the more interesting).

> For tracing, I added the TraceMessageSends instrumentation from John and 
> extended it with some more data. (BTW, I vote to integrate this by 
> default as it proved, with the signaling facility, very useful!)

I think we need a "production Squeak" list or somesuch. We have a few 
more of these things in our images and VMs that are immensely helpful 
for tracking down issues like this and that sounds just like another one 
that a production server really needs to have.

> The following message trace was started when the image got stuck and it 
> was stopped again a few seconds after Squeak continued working normal 
> again:
> 
> http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)
> 
> The extended format of the trace is: proc, ctxt, statIncrGCs, 
> statTenures, statFullGCs, ioMSecs
> 
>  From this trace, the problem looks to be explainable as follows:
> - The time elapsed between two calls to WorldState>interCyclePause: is 
> longer than 20ms. The cause for this is not the UI process using up a 
> lot of time, but that there are 3 incr. GCs consuming 6 to 8ms each 
> within each UI cycle.

Is this a low-end server? 6-8ms is a *lot* of time being spent in IGC - 
we typically see 6-8ms only after increasing the default GC parameters 
(# of allocations between GCs, various thresholds) by roughly a factor 
of ten. Also, the above means that effectively all or almost all of the 
time is spent in those GCs which seems doubtful unless you have a busy 
UI (we had this problem once when we left a Transcript open, which is a 
terribly bad idea on a server image).

> - Therefore, the UI process does not wait on the delay in 
> WorldState>interCyclePause:
> - Because the UI main loop only does a yield (see Process 
> class>>spawnNewProcess) the UI process therefore stays runnable all the 
> time as there is no other process with p = 40.
> - Therefore, no process with p < 40 has a chance to be activated (only 
> higher ones, which we find in the trace). This also explains why we see 
> 100% CPU usage, but still the UI responds immediately.

This sounds like a reasonable explanation.

> Now, why does moving the mouse make it run again? I have no idea... my 
> guess is that the triggered behavior of a mouse move event somehow 
> forces a full GC. In the trace we see that when the 107th full GC is 
> done, there are much fewer incr. GCs later on. Hence, it is much more 
> likely that the UI process pauses again.

Tenuring might fix it, too. And it may just be that your wiggling the 
mouse creates the bit of extra garbage to make the VM tenure.

> How could we fix this?
> -----------------------------------------
> a) Simply increase the 20ms pause defined by MinCycleLapse (at least for 
> production systems) or tweak the "pause logic". As a test I defined 
> MinCycleLapse to be 40ms. I could not reproduce the problem anymore.
> 
> b) In production, suspend the UI process and only turn it on again when 
> you need it (we do this via HTTP). This should also improve performance 
> a bit. At best this is a workaround.
> 
> c) Tune the GC policies as they are far from optimal for today's systems 
> (as John has suggested a couple of times). It seems, though, that this 
> cannot guarantee to fix the problem but it should make it less likely to 
> happen(?).

d) Don't use processes that run below user scheduling priority. To be 
honest, I'm not sure why you'd be running anything below UI priority on 
a server.

e) Make a preference "lowerPerformance" (or call it "headlessMode" if 
you wish :^) and have the effect be that in intercyclePause: you 
*always* wait for a certain amount of time (20-50ms). This will ensure 
that your UI process can't possibly eat up all of your CPU.

> I'd be interested in getting feedback on:
> - whether the explanation sounds plausible

It does. There is however a question what could possibly generate enough 
load on the garbage collector to run IGCs that take on average 7ms and 
run three of them in a single UI cycle.

> - whether the fix (e.g., a)) workes for other people that have this 
> problem.
> - what may be a good fix

I'd probably go with option e) above since it ensures that there is 
always time left for the lower priority processes (and you don't have to 
change other code). Everything else seems risky since you can't say for 
sure if there isn't anything that keeps the UI in a busy-loop.

Cheers,
   - Andreas



More information about the Squeak-dev mailing list