Image freezing problem

Thu Jul 26 17:06:04 UTC 2007

Hi,

We recently run into the image freezing problem, where connecting  
with VNC and moving the mouse would bring the image alive again. When  
being stuck, Squeak hoggs the CPU, memory consumtion is stable.

There have been various reports matching the described symptoms,  
e.g., http://lists.squeakfoundation.org/pipermail/squeak-dev/2006- 
August/107618.html.

When seeing Andreas' delay issue, it looked just obvious that it's  
exactly this same problem. But the fix does not resolve it.
So, after quite a lot of hours of debugging, here is an explanation  
of what is going on (kudos to Marcus, who helped me with putting the  
last pieces of the puzzle together).

The problem occurred when producing a PDF of around 800 pages in a  
background process (priority 25). In production, this process runs  
besides serving a Seaside application (running with priority 30) and  
involves accessing the DB (PostgresQL). However, I managed to come up  
with a reproducable setup without involving Seaside requests and  
without accessing the DB. All there is: a CPU and memory intensive  
(~130MB) background process which does some logging to the file system.

Setup:
-----------------------------------------

Problem appeared in production with an older VM, so I decided to  
build a new one with the most recent fixes:

- VM built from VMM 3.8b6 from SM + VmUpdates-100[1-6]-dtl.1.cs + JMM- 
VmUpdates32bitclean.2.cs
- Squeak 3.7 image + Andreas' delay fix
- gcc 2.95.4, Linux 2.4.18-686-smp #1 SMP i686
- Squeak is started headfull in VNC

For tracing, I added the TraceMessageSends instrumentation from John  
and extended it with some more data. (BTW, I vote to integrate this  
by default as it proved, with the signaling facility, very useful!)

Explanation for the cause:
-----------------------------------------

The following message trace was started when the image got stuck and  
it was stopped again a few seconds after Squeak continued working  
normal again:

http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)

The extended format of the trace is: proc, ctxt, statIncrGCs,  
statTenures, statFullGCs, ioMSecs

 From this trace, the problem looks to be explainable as follows:
- The time elapsed between two calls to WorldState>interCyclePause:  
is longer than 20ms. The cause for this is not the UI process using  
up a lot of time, but that there are 3 incr. GCs consuming 6 to 8ms  
each within each UI cycle.
- Therefore, the UI process does not wait on the delay in  
WorldState>interCyclePause:
- Because the UI main loop only does a yield (see Process  
class>>spawnNewProcess) the UI process therefore stays runnable all  
the time as there is no other process with p = 40.
- Therefore, no process with p < 40 has a chance to be activated  
(only higher ones, which we find in the trace). This also explains  
why we see 100% CPU usage, but still the UI responds immediately.

So, while the UI process runs, the incr. GCs use up most of the time  
and keep the UI process active. There is just not enough space  
requested to trigger a full GC.

Now, why does moving the mouse make it run again? I have no idea...  
my guess is that the triggered behavior of a mouse move event somehow  
forces a full GC. In the trace we see that when the 107th full GC is  
done, there are much fewer incr. GCs later on. Hence, it is much more  
likely that the UI process pauses again.

How could we fix this?
-----------------------------------------
a) Simply increase the 20ms pause defined by MinCycleLapse (at least  
for production systems) or tweak the "pause logic". As a test I  
defined MinCycleLapse to be 40ms. I could not reproduce the problem  
anymore.

b) In production, suspend the UI process and only turn it on again  
when you need it (we do this via HTTP). This should also improve  
performance a bit. At best this is a workaround.

c) Tune the GC policies as they are far from optimal for today's  
systems (as John has suggested a couple of times). It seems, though,  
that this cannot guarantee to fix the problem but it should make it  
less likely to happen(?).
I also took GC statistics while running the process (only part of it  
and not when it was stuck):
- http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard  
configuration (was very slow)
- http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with  
setGCBiasToGrowGCLimit: 16*1026*1024

d) ?

I'd be interested in getting feedback on:
- whether the explanation sounds plausible
- whether the fix (e.g., a)) workes for other people that have this  
problem.
- what may be a good fix

Cheers,
Adrian

___________________
Adrian Lienhard
www.adrian-lienhard.ch