Janko and I have been doing some fiddling around and monitoring today and I begin to develop a theory.
First, meet some images:
Image 1: SqueakMap (map.squeak.org). This is our oldest Squeak service image with the possible exception of wiki.squeak.org. This is a 3.8 image running on a 3.8 VM. SqueakMap saves it's data to a separate file and snapshots of the image are only made when source code changes occur, aka manually.
Image 2: SqueakSource (source.squeak.org). This is a 3.11 image running on a 3.11 VM. This image saves it's data by snapshoting hourly.
Image 3: www.squeak.org. This is a 3.9 VM that has been running for some time on the same 3.8 VM as SqueakMap but we just a short while ago decided to try it on the 3.11 VM SqueakSource is using and it is running fine so far.
So for a few hours now I have been monitoring the memory usage and GC behavior of these three images. The short story is that SqueakMap is almost completely rock solid but that the SqueakSource image, despite my earlier claims to the contrary, appears to suffer a much reduced form of the problem we are observing on www.squeak.org.
To reiterate: The problem is that the resident set size of the process appears to continually grow. The operating system will occasionally swap out some of this giving the impression that the RSS has dropped but I'm almost certain that the total cost (RAM+swap) of the process is the same and continuing to increase. Our available RAM and swap on the server is quite limited with all the various services we are running.
What I'm observing is a discrepancy between what the VM claims is the end of memory and the RSS. When any of these images start this discrepancy is quite small, maybe a megabyte. But whenever a GC occurs the end of memory figure drops, but the RSS goes up. So from the images side it all appears copacetic: we cleaned up the garbage and we are more or less back at our base memory usage level. But as far as the operating system is concerned nothing has happened, in fact more memory has been claimed.
So my theory is that somehow when GC occurs on the linux VM (3.8 VM and 3.11 VM) the freed up memory is not released to the operating system but it seems that new objects go in a different memory location from the operating system's perspective therefore using additional memory on top of what was used before the GC. I only really observe this on full GCs. When few if any full GCs occur, there is no real problem. But when you do things like snapshot the image hourly, this all adds up to several megabytes a day of lost RAM.
This may not be quite right, I'm still observing and of course this theory is built out of black box observation and little else. Feel free to set me straight.
Ken
vm-dev@lists.squeakfoundation.org