Quick summary: The default paramaters for the GC are a poor choice for RAM-rich modern computers. A 3x performance gain on GC is obtainable, if you're willing to dedicate RAM to squeak. (My desktop has 768, so I think nothing of running with 100-500mb)
The current squeak GC does an incremental collection every 4000 allocations. Although this can lead to sub-ms latency, it is excessive, especially for long running or many GUI applications.
So, running macrobenchmarks, and counting GC time, I get:
22 seconds of GC time. (only GC when you absolutely must, 500mb RAM.) 60 seconds of GC time. (GC every 4000 allocations) +15 additional seconds of fullGC's in either case [*]
A typical run of macrobenchmarks is about 360 seconds on the P2-450.
I'm not sure the threshold is where GC time drops by 2/3, but the fact that it does is an indication that other people can have similar gains. I even see some gain at 50mb and 200kallocs/GC.
[*] These additional GC's occur because of root table overflows, with RootTableOverflow.cs, they become about 1/4 second total.
I typically run with: Smalltalk vmParameterAt: 5 put: 400000. Smalltalk vmParameterAt: 6 put: 12000.
which gives me <100ms incrgc latency. But, given average object size, its about 10mb/100k objects, so you'll want to run with 40mb free space to exploit those parameters, more memmory for higher numbers.
Note that SystemDictionary>>setGCParameters resets these values to the default ones on each restart. You will have to change them there.
Scott A Crosby crosby@qwes.math.cmu.edu is claimed by the authorities to have written:
Quick summary: The default paramaters for the GC are a poor choice for RAM-rich modern computers. A 3x performance gain on GC is obtainable, if you're willing to dedicate RAM to squeak. (My desktop has 768, so I think nothing of running with 100-500mb)
Well lucky you; my machines don't have that size memory and in fact my main machine doesn't even use virtual memory. PDAs, embedded devices, network computers and so are likewise memory limited. Let's remember that before leaping to any changes.
I typically run with: Smalltalk vmParameterAt: 5 put: 400000. Smalltalk vmParameterAt: 6 put: 12000.
Which is of course why those parameters are settable. Maybe some startup code to work out sensible machine defaults wouldbe a good idea. VW has allowed gc tuning for years - actually over a decade - with exquisitely precise policy capabilities.
Almost nobody ever uses them :-(
Almost everybody complains that the defaults don't work.
I'm not at all certain that the current Squeak memory system is really suited to huge memories. I can't help thinking that at least one more generation would be a good idea, along with perhaps a space for large non-pointer objects (like bitmaps). Which is why, for years, I've wanted to get a chance to develop a modular memory & execution engine that can be configured to suit the intended use. Funding offers to the usual email address.
tim
code to work out sensible machine defaults wouldbe a good idea. VW has allowed gc tuning for years - actually over a decade - with exquisitely precise policy capabilities.
Almost nobody ever uses them :-(
Almost everybody complains that the defaults don't work.
The defaults have changed recently, mostly because the defaults at 12 odd years old were a bid long in the tooth. And you are right no-one ever looks at the memory policy logic until it's much too late.
However I've been thinking off and on about squeak and I guess I should propose a few VM changes to make some data collection easier, then code up a memory policy creature to adjust things dynamically.
There are a few places where we need to collect millisecond information, and some statistics on loop counters. With that one can fiddle all you want. How that affects an application is anyones guess.
PS back to my GC paper, where I say I'm waiting for the days when you can allocate a few trillion bytes, run the VM, quit and never need to do a GC. That's a more effective solution.
John M McIntosh johnmci@smalltalkconsulting.com is claimed by the authorities to have written:
PS back to my GC paper, where I say I'm waiting for the days when you can allocate a few trillion bytes, run the VM, quit and never need to do a GC. That's a more effective solution.
That was once a serious suggestion in a paper I read; the suggestion was that virtual memory solved the problem since all you had to do was page out garbage. Slight problem with the (then extant) data rates & disk sizes - Smalltalk would have filled the disk in a few minutes :-)
It might take a little longer these days, but still not long enough to be interesting, I think.
tim
On Tue, 29 Jan 2002, Tim Rowledge wrote:
Scott A Crosby crosby@qwes.math.cmu.edu is claimed by the authorities to have written:
Quick summary: The default paramaters for the GC are a poor choice for RAM-rich modern computers. A 3x performance gain on GC is obtainable, if you're willing to dedicate RAM to squeak. (My desktop has 768, so I think nothing of running with 100-500mb)
Well lucky you; my machines don't have that size memory and in fact my main machine doesn't even use virtual memory. PDAs, embedded devices, network computers and so are likewise memory limited. Let's remember that before leaping to any changes.
I went from 256 to 768 for $80, 4 months ago.
Which is of course why those parameters are settable. Maybe some startup code to work out sensible machine defaults wouldbe a good idea. VW has allowed gc tuning for years - actually over a decade - with exquisitely precise policy capabilities.
Almost nobody ever uses them :-(
Almost everybody complains that the defaults don't work.
Heh, perhaps it would be good for it to remark, on the profiling page, a notation that ``your program uses excessive time in GC, have you considered setting your GC paramaters to: .....''
Having them set dynamically based on how much RAM you are dedicating to squeak and based on some menu item is this image a: 'high latency' 'mid-latency' or 'low-latency' and have it autodetect if its a 'small <20mb' 'medium 30-70mb' 'large 100-200mb' or 'very large >300mb' image.
This might help it scale to larger images in a more user-friendly way. Have the profiler spit out suggestions if it thinks GC time is excessive. That way they won't be missed.
I'm not at all certain that the current Squeak memory system is really suited to huge memories. I can't help thinking that at least one more generation would be a good idea, along with perhaps a space for large binary (non-reference-containing) objects.
I've thought over that some.. I'm not sure where an extra generation helps. It introduces an extra range check into every write. You'd want to use conservative approximations.
I can see how storing large binary objects externally (through malloc or something similar) would work, primarily, it would save them from being moved in a compaction.
--
A lot depends on what the actual runtime statistics are like for the different stages of GC. I shold examine these numbers sometime. Is GC a signifigant problem? (Lets wait for both BC and my methodcache before answering that question.)
Really, whats probably needed eventually is dynamic feedback with the interpreter? If you can tell, based on where the allocation occurs, the GC behavior of the object, you can be more clever with allocation policies.
Eh.. I'm not going to touch anything, unless it seems really broken, until BC *and* my new method cache *and* the root table overflow are in a VM I can benchmark and profile and see the costs of GC, or whatever else is a signifigant performance issue.)
If someone has tried integrating these and has a linux image with all of the above that can generate a valid VM, I'd appreciate a copy of image&build tree. If the GC is bad or there are some other performance problems, I might have time to do some work on it by V4. Has the VMMaker tree stabilized?
The other question is, what is the hoped-for future of squeak? Something intended to be an applications-development platform, with 100mb images, or something intended to run on handhelds and portables?
Scott
On Tuesday, January 29, 2002, at 11:15 PM, Scott A Crosby wrote:
Quick summary: The default paramaters for the GC are a poor choice for RAM-rich modern computers. A 3x performance gain on GC is obtainable, if you're willing to dedicate RAM to squeak. (My desktop has 768, so I think nothing of running with 100-500mb)
The current squeak GC does an incremental collection every 4000 allocations. Although this can lead to sub-ms latency, it is excessive, especially for long running or many GUI applications.
So, running macrobenchmarks, and counting GC time, I get:
22 seconds of GC time. (only GC when you absolutely must, 500mb RAM.) 60 seconds of GC time. (GC every 4000 allocations) +15 additional seconds of fullGC's in either case [*]
A typical run of macrobenchmarks is about 360 seconds on the P2-450.
I'm not sure the threshold is where GC time drops by 2/3, but the fact that it does is an indication that other people can have similar gains. I even see some gain at 50mb and 200kallocs/GC.
Just a thought: since the macro-benchmarks are doing full-GCs between each benchmark (not timed), could it be that the performance improvements are just artifacts of the benchmarking process?
In essence, you're getting a GC that's 'free' between each of the benchmarks. So if you can delay the need for a GC until the end of each benchmark, then you will have 0 GC overhead in the benchmarks, without any accompanying real-world gain.
Marcel
-- Marcel Weiher Metaobject Software Technologies marcel@metaobject.com www.metaobject.com Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
On Mon, 18 Feb 2002, Marcel Weiher wrote:
Just a thought: since the macro-benchmarks are doing full-GCs between each benchmark (not timed), could it be that the performance improvements are just artifacts of the benchmarking process?
Yep.. I checked for that and hacked up macrobenchmarks to not preallocate all but 10mb of RAM, and second, to not GC at all, and got an slightly greater speedup. (But, this is slightly cheating, as it leaves 10-20 megs of garbage in oldspace behind which I must count by amortizing them as about an additional second/two of GC time to remove later.
Scott
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
Don't forget that the current GC (and any Squeak GC) has to handle real time demands for animation and music, etc.
Cheers,
Alan
------
At 12:41 PM +0100 2/18/02, Marcel Weiher wrote:
On Tuesday, January 29, 2002, at 11:15 PM, Scott A Crosby wrote:
Quick summary: The default paramaters for the GC are a poor choice for RAM-rich modern computers. A 3x performance gain on GC is obtainable, if you're willing to dedicate RAM to squeak. (My desktop has 768, so I think nothing of running with 100-500mb)
The current squeak GC does an incremental collection every 4000 allocations. Although this can lead to sub-ms latency, it is excessive, especially for long running or many GUI applications.
So, running macrobenchmarks, and counting GC time, I get:
22 seconds of GC time. (only GC when you absolutely must, 500mb RAM.) 60 seconds of GC time. (GC every 4000 allocations) +15 additional seconds of fullGC's in either case [*]
A typical run of macrobenchmarks is about 360 seconds on the P2-450.
I'm not sure the threshold is where GC time drops by 2/3, but the fact that it does is an indication that other people can have similar gains. I even see some gain at 50mb and 200kallocs/GC.
Just a thought: since the macro-benchmarks are doing full-GCs between each benchmark (not timed), could it be that the performance improvements are just artifacts of the benchmarking process?
In essence, you're getting a GC that's 'free' between each of the benchmarks. So if you can delay the need for a GC until the end of each benchmark, then you will have 0 GC overhead in the benchmarks, without any accompanying real-world gain.
Marcel
-- Marcel Weiher Metaobject Software Technologies marcel@metaobject.com www.metaobject.com Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
--
Alan Kay wrote:
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
We (in practical terms, Craig & I) were given permission to release all the applicable Squeak stuff ages ago. With luck the tranche of stuff handed to Stanford might include a copy of the lost GC code; we've never been able to find a copy anywhere.
It wasn't so much a sub-millisecond GC as a constant slow-drip GC. It's along time ago, but as I recall it involved every object header having a forward and backward pointer. Basivally, 'killing' an object involved moving it from the live double linked list to the dead one - changing 4 pointers. This of course does not handle compaction and I think we concluded that for most of the purposes we were interested in there was very little need to worry about it too often. Buit why am I blathering about it - Paul McCullough reads this and can surely explain much more since he lead that sub-project.
The other important aspect of the real-time nature of the Interval system was the combined Squeak/native process scheduling that allowed interrupt handlers to be written in Squeak. Oh, and of course the reduced latency changes. I would note that we had the luxury of not having some dumbf&*k OS in the way to cause problems. As soon as some nincompoop had the bright idea of using a 'proper OS' (winCE in this case) it all went to hell.
Don't forget that the current GC (and any Squeak GC) has to handle real time demands for animation and music, etc.
Exactly; making sure that that works almost always exacts a cost in aggregate performance. The Interval system was really pretty slow becasue of all the work it had to do to make sure it was possible to 'change its mind' at a microseconds notice.
Optimisations for straight run code are often going to slow down heterogenous situations (this also involved in the perennial debate about typing and performance), big caches seem great in benchmarks designed to show the benefits of caches but look a bit lame in very variable code (viz Sparc register windows - look wonderful until you have to change context), adding memory seems great until you have a potential customer that wants to run in 8Mb.
tim
At 07:55 AM 2/18/02 -0800, Alan Kay wrote:
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
Not to put a damper on things, but... The Interval real-time collector (of which I am the author) should be viewed as a prototype. It was based on Baker's Treadmill algorithm -- not the Train algorithm.
Due to some design constraints, and due to running out of time (I was quite ill for a while and then left Interval shortly before they terminated the project) I certainly wouldn't call the collector finished. If Glenn Edens is saying that the collector was sub-millisecond, I have no idea where he got those numbers. On paper yes, in reality no one knows. In fact, due to some bugs in the in-liner, much of the code was never in-lined and it ran much, much slower than the standard Squeak collector. But it did run Squeak for hours without crashing so the concept worked.
Other research shows that real-time collectors use considerably more time to run than 'straightforward' collectors: 5 - 8 times more if I recall correctly. The real-time advantage is that the work is spread out and hence much less disruptive to the running system. But you have to pay the price somewhere.
Before anyone asks, I don't the source code (which was written in Interval's version of Slang). I know where it was on Interval's servers when I left the company, but have been asked by Tim Rowledge and others if I have a copy, so it may be lost or somewhere in backup-ether.
I occasionally have a flash of insight as to how to simplify the collector -- just the other day I realized how to deal with the difficulties of dealing with become: in a much simpler way.
If someone wants to pursue putting a real-time collector into Squeak, I can likely provide some guidance (off-list) . One thing that would help tremendously would be to simplify & clarify the interface between the Interpreter and ObjectMemory (at least it would have years ago, I haven't look at the interface recently).
paul
Paul --
As usual, the truth is often inconvenient. heh heh
Cheers,
Alan
-----
At 3:08 PM -0800 2/18/02, Paul McCullough wrote:
At 07:55 AM 2/18/02 -0800, Alan Kay wrote:
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
Not to put a damper on things, but... The Interval real-time collector (of which I am the author) should be viewed as a prototype. It was based on Baker's Treadmill algorithm -- not the Train algorithm.
Due to some design constraints, and due to running out of time (I was quite ill for a while and then left Interval shortly before they terminated the project) I certainly wouldn't call the collector finished. If Glenn Edens is saying that the collector was sub-millisecond, I have no idea where he got those numbers. On paper yes, in reality no one knows. In fact, due to some bugs in the in-liner, much of the code was never in-lined and it ran much, much slower than the standard Squeak collector. But it did run Squeak for hours without crashing so the concept worked.
Other research shows that real-time collectors use considerably more time to run than 'straightforward' collectors: 5 - 8 times more if I recall correctly. The real-time advantage is that the work is spread out and hence much less disruptive to the running system. But you have to pay the price somewhere.
Before anyone asks, I don't the source code (which was written in Interval's version of Slang). I know where it was on Interval's servers when I left the company, but have been asked by Tim Rowledge and others if I have a copy, so it may be lost or somewhere in backup-ether.
I occasionally have a flash of insight as to how to simplify the collector -- just the other day I realized how to deal with the difficulties of dealing with become: in a much simpler way.
If someone wants to pursue putting a real-time collector into Squeak, I can likely provide some guidance (off-list) . One thing that would help tremendously would be to simplify & clarify the interface between the Interpreter and ObjectMemory (at least it would have years ago, I haven't look at the interface recently).
paul
--
At 7:55 AM -0800 2/18/02, Alan Kay wrote:
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
Don't forget that the current GC (and any Squeak GC) has to handle real time demands for animation and music, etc.
Alan's point goes directly to the heart of the GC tradeoff: if you want to minimize the total GC overhead, and you don't mind a GC running for a few hundred milliseconds, then the number of allocations between GC's can be set to a high number, as Scott showed in his experiment. If, on the other hand, you're generating 20 or 30 voices of music in real time while doing animation, you want to break the GC work into chunks too short to be noticable, say under 10 milliseconds. In this case, you're willing to let the GC do more total work in order to avoid noticable pauses. One might call this the "proactive" approach to garbage collection.
Different applications benefit from different settings, which is one reason the parameters can be changed at run time. (The other reason is to allow experiments such as the one Scott just performed. :-)) The current defaults were set to allow smooth real-time multimedia on a 100 MHz PowerPC or equivalent. Because Squeak was developed for a community of teachers and children, we believed (and still do) that the default settings should support multimedia even on older machines. Now that many schools are upgrading to iMacs, it might be worth increasing the defaults a little, or, as John Mcintosh suggested writing some startup code that automatically sets the GC parameters to values that support low-latency GC's.
The Interval Squeak VM had a custom incremental collector based, I think, on the Train algorithm. I believe their GC also used more processor time in GC overall, but it allowed for such low latency that one could actually write interrupt handlers for hardware devices in Squeak. I'm sure Tim can supply more details if you're interested!
Scott's experiments show that there's a factor of three in GC overhead between the current parameter settings and an extreme setting when Squeak has 500 MB of real memory available. It would be interesting to plot some of the points in between for the same set of benchmarks. Is it a linear relationship or is there a "knee" in the curve? This data might help John Mcintosh with his idea of adaptively setting the GC parameters at startup time.
-- John
P.S. By the way, RAM prices have gone back up. You can no longer get 512 MBytes for under $100. There was apparently a price war going on late last year. But it's still incredibly cheap compared to a few years ago.
On Wed, 18 Dec 2002 John.Maloney@disney.com wrote:
At 7:55 AM -0800 2/18/02, Alan Kay wrote:
Interval had a great submillisecond GC that was terrific for real-time apps. Maybe we can get this from them now that they have donated their work to Stanford (per Glenn Edens).
Don't forget that the current GC (and any Squeak GC) has to handle real time demands for animation and music, etc.
Alan's point goes directly to the heart of the GC tradeoff: if you want to minimize the total GC overhead, and you don't mind a GC running for a few hundred milliseconds, then the number of allocations between GC's can be set to a high number, as Scott showed in his experiment. If, on the other hand, you're generating 20 or 30 voices of music in real time while doing animation, you want to break the GC work into chunks too short to be noticable, say under 10 milliseconds.
3mb more RAM, and one incrGC every 40k allocs will do this on a P2-450.
In this case, you're willing to let the GC do more total work in order to avoid noticable pauses. One might call this the "proactive" approach to garbage collection.
Different applications benefit from different settings, which is one reason the parameters can be changed at run time. (The other reason is to allow experiments such as the one Scott just performed. :-)) The current defaults were set to allow smooth real-time multimedia on a 100 MHz PowerPC or equivalent. Because Squeak was developed for a community of teachers and children, we believed (and still do) that the default settings should support multimedia even on older machines.
Ah, that explains why they're so low. But, I suspect that a lot of the uses of squeak are not that hard realtime, or not on that sort of hardware. So, the GC parm are specifically designed for the hardest task on the slowest hardware, when both of them are relatively uncommon, both individually, and especially in combinations.
Now that many schools are upgrading to iMacs, it might be worth increasing the defaults a little, or, as John Mcintosh suggested writing some startup code that automatically sets the GC parameters to values that support low-latency GC's.
Yes, I'd have a latency knob, and try to optimize GC paramaters so that the average latency is within, say, the ranges of: very low <4ms low <40ms medium <100ms high <400ms very high >1000ms
Note that at 'very high', you're feeding squeak 200mb or more of RAM, and only paying that GC every 20-60 seconds.
Roughly, I profile it at:
Can GC 60mb in 170ms, or about 360mb (6m objects) in a second. Can incrGC 300MB in 1600ms. Can fullGC 20MB in 400ms (P3-500 laptop) --
But, being adaptive depending on the applications is best.
Scott's experiments show that there's a factor of three in GC overhead between the current parameter settings and an extreme setting when Squeak has 500 MB of real memory available. It would be interesting to plot some of the points in between for the same set of benchmarks. Is it a linear relationship or is there a "knee" in the curve? This data might help John Mcintosh with his idea of adaptively setting the GC parameters at startup time.
Its probably a knee, but the knee will be in a different part for each application, or for different datasets for the same applications.
Scott
Hi, Scott,
At 3:17 PM -0500 2/18/02, Scott A Crosby wrote:
On Wed, 18 Dec 2002 John.Maloney@disney.com wrote:
Different applications benefit from different settings, which is one reason the parameters can be changed at run time. (The other reason is to allow experiments such as the one Scott just performed. :-)) The current defaults were set to allow smooth real-time multimedia on a 100 MHz PowerPC or equivalent. Because Squeak was developed for a community of teachers and children, we believed (and still do) that the default settings should support multimedia even on older machines.
Ah, that explains why they're so low. But, I suspect that a lot of the uses of squeak are not that hard realtime, or not on that sort of hardware. So, the GC parm are specifically designed for the hardest task on the slowest hardware, when both of them are relatively uncommon, both individually, and especially in combinations.
True, but teachers and kids are not likely to have the sophistication to tweak GC settings, whereas the serious Smalltalkers on the Squeak list do. So, if we did enough benchmarking to come up with settings for some of the intersting positions on your "latency vs. GC overhead" knob and perhaps make a little UI for selecting these settings, I think we could address the needs of both communities.
Congratulations on getting into graduate school, by the way. Where did you do your undergrad studies?
-- John
P.S. I just noticed that my clock was set wrong this morning. Apologies...
On Mon, 18 Feb 2002 John.Maloney@disney.com wrote:
Hi, Scott,
At 3:17 PM -0500 2/18/02, Scott A Crosby wrote:
Ah, that explains why they're so low. But, I suspect that a lot of the uses of squeak are not that hard realtime, or not on that sort of hardware. So, the GC parm are specifically designed for the hardest task on the slowest hardware, when both of them are relatively uncommon, both individually, and especially in combinations.
True, but teachers and kids are not likely to have the sophistication to tweak GC settings, whereas the serious Smalltalkers on the Squeak list do.
I don't know if I agree. People *really* tend to leave defaults alone.
So, if we did enough benchmarking to come up with settings for some of the intersting positions on your "latency vs. GC overhead" knob and perhaps make a little UI for selecting these settings, I think we could address the needs of both communities.
The GC settings are adaptively chosen so-as to make sure the average latency is within the given target.
Then a UI for choosing the target.... Say, where the target latency is chosen per-project. (Assume that project authors *will* correctly set the latency for the project.)
Let the default latency target be 40ms, realtime projects set it lower, long-time processing projects can set it to 400ms or higher.
Congratulations on getting into graduate school, by the way. Where did you do your undergrad studies?
Thanks... Carnegie Mellon University
Scott
On Monday, February 18, 2002, at 09:17 PM, Scott A Crosby wrote:
Roughly, I profile it at:
Can GC 60mb in 170ms, or about 360mb (6m objects) in a second.
Hmm.... 170 ms * 6 = 1020 ms, or about a second. If these numbers are accurate, there doesn't seem to be any overall performance benefit from delaying the GC (apart from completely avoiding it in a specific period of time). Or very likely I am missing something.
Marcel
-- Marcel Weiher Metaobject Software Technologies marcel@metaobject.com www.metaobject.com Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
On Mon, 18 Feb 2002, Marcel Weiher wrote:
On Monday, February 18, 2002, at 09:17 PM, Scott A Crosby wrote:
Roughly, I profile it at:
Can GC 60mb in 170ms, or about 360mb (6m objects) in a second.
Hmm.... 170 ms * 6 = 1020 ms, or about a second. If these numbers are accurate, there doesn't seem to be any overall performance benefit from delaying the GC (apart from completely avoiding it in a specific period of time). Or very likely I am missing something.
These are raw numbers, and inconsistent with each other.. I get fullGC about 4x-8x slower than a incrGC on the same number of bytes.
Can incrGC 300MB in 1600ms. Can fullGC 20MB in 400ms
We *know* that increasing these parameters makes macroBenchmarks go faster. It also avoids methodCache and atCache flushes, which will slow down computation. (This also makes it more feasible to have much larger method&at caches.)
Scott.
On Monday, February 18, 2002, at 11:50 PM, Scott A Crosby wrote:
On Mon, 18 Feb 2002, Marcel Weiher wrote:
On Monday, February 18, 2002, at 09:17 PM, Scott A Crosby wrote:
Roughly, I profile it at:
Can GC 60mb in 170ms, or about 360mb (6m objects) in a second.
Hmm.... 170 ms * 6 = 1020 ms, or about a second. If these numbers are accurate, there doesn't seem to be any overall performance benefit from delaying the GC (apart from completely avoiding it in a specific period of time). Or very likely I am missing something.
These are raw numbers, and inconsistent with each other..
Well do we have consistent numbers anywhere? If the numbers are so inconsistent, what conclusions are we drawing from them?
I get fullGC about 4x-8x slower than a incrGC on the same number of bytes.
Can incrGC 300MB in 1600ms. Can fullGC 20MB in 400ms
Sure. So? How does the difference between incremental and full GC relate to my observation that the numbers posted above show a linear relationship.
We *know* that increasing these parameters makes macroBenchmarks go faster.
We *really* do?
It also avoids methodCache and atCache flushes, which will slow down computation. (This also makes it more feasible to have much larger method&at caches.)
Hmm...
Marcel
On Tue, 19 Feb 2002, Marcel Weiher wrote:
On Monday, February 18, 2002, at 11:50 PM, Scott A Crosby wrote:
Well do we have consistent numbers anywhere? If the numbers are so inconsistent, what conclusions are we drawing from them?
The numbers are variable based on the workloads and types of objects. The estimate of 4x-8x is only an estimate.
Sure. So? How does the difference between incremental and full GC relate to my observation that the numbers posted above show a linear relationship.
Oh, duh, yes.. When I was doing quick benchmarks of GC performance, I got the raw number: ' Can GC 60mb in 170ms' from which I calculated 'or about 360mb (6m objects) in a second.' Which explains your observation of the linear dependence.
We *know* that increasing these parameters makes macroBenchmarks go faster.
We *really* do?
Yep.. Try increasing them, giving squeak a couple hundred megs of RAM, and running macrobenchmarks.
Scott
On Tuesday, February 19, 2002, at 12:37 AM, Scott A Crosby wrote:
We *know* that increasing these parameters makes macroBenchmarks go faster.
We *really* do?
Yep.. Try increasing them, giving squeak a couple hundred megs of RAM, and running macrobenchmarks.
Well, I just tried it for the VM parameter 4, and setting that to 40000 was actually measurably *slower* than setting it at 4000 (135s vs. 130s). The numbers of incremental GCs were typically around a factor of 10 lower, and the time devoted to incremental GCs also significantly lower, but the total running time of each of the individual benchmarks increased.
Also setting vmParameter 5 to 12000 (in addition to vmParameter 4 to 40000) further slows down the macro-benchmarks, to 148 seconds.
So this doesn't seem to be a worthwhile 'optimization', at least on my system. ( dual G4/1GHz, 512 MB RAM, 128MB allocated to Squeak).
Marcel
-- Marcel Weiher Metaobject Software Technologies marcel@metaobject.com www.metaobject.com Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
On Tue, 19 Feb 2002, Marcel Weiher wrote:
On Tuesday, February 19, 2002, at 12:37 AM, Scott A Crosby wrote:
We *know* that increasing these parameters makes macroBenchmarks go faster.
We *really* do?
Yep.. Try increasing them, giving squeak a couple hundred megs of RAM, and running macrobenchmarks.
Well, I just tried it for the VM parameter 4, and setting that to 40000 was actually measurably *slower* than setting it at 4000 (135s vs. 130s). The numbers of incremental GCs were typically around a factor of 10 lower, and the time devoted to incremental GCs also significantly lower, but the total running time of each of the individual benchmarks increased.
Strange.. As #4 is a read-only parameter. (See source for vmParameterAt:)
Methinks you're seeing measurement noise.
Also setting vmParameter 5 to 12000 (in addition to vmParameter 4 to 40000) further slows down the macro-benchmarks, to 148 seconds.
Methinks you're seeing measurement noise.
Also, macrobenchmarks allocates all but 10mb before benchmarking, remove that code. And, optionally, remove the GC's it does between the seperate benchmarks)
Then try setting #5 to 800k #6 to 12k
Note that these numbers are reset on each image startup.
Then rebenchmark.
Scott
On Tuesday, February 19, 2002, at 02:15 AM, Scott A Crosby wrote:
Well, I just tried it for the VM parameter 4, and setting that to 40000 was actually measurably *slower* than setting it at 4000 (135s vs. 130s). The numbers of incremental GCs were typically around a factor of 10 lower, and the time devoted to incremental GCs also significantly lower, but the total running time of each of the individual benchmarks increased.
Strange.. As #4 is a read-only parameter. (See source for vmParameterAt:)
It was parameters 5 and 6, not 4 and 5. Late here.
Methinks you're seeing measurement noise.
No, the results are consistent and persistent.
Also setting vmParameter 5 to 12000 (in addition to vmParameter 4 to 40000) further slows down the macro-benchmarks, to 148 seconds.
Methinks you're seeing measurement noise.
No.
Also, macrobenchmarks allocates all but 10mb before benchmarking, remove that code.
OK, did that, then re-ran the benchmarks with a variety of settings.
With CocoaSqueak 3.2, on a dual 1GHz G4/512MB total RAM, 128MB allocated to Squeak, image size 17MB.
The parameters that were varied were VM parameter 5/6, using the following doit (parameters varied accordingly...)
Smalltalk vmParameterAt: 5 put: 2000. Smalltalk vmParameterAt: 6 put: 1000. Smalltalk macroBenchmarks.
Default parameters 4K/2K: 128262 , 130944, 127643
Parameters increased to 40K/12K: 141371, 141350, 148059
Smaller increase, smaller slowdown 10K/12K: 136736
Parameters accidentally decreased to 4K/ 1.2K: 125945
Intriguied, I set param 6 even lower, to 4K/1K and got: 124051, 124079
"How low can you go", 2K/1K got worse again: 127820, 127739
From these measurements, the sweet spot seems to be at somewhat lower settings than the current default, at 4K/1K.
Marcel
On Tue, 19 Feb 2002, Marcel Weiher wrote:
On Tuesday, February 19, 2002, at 02:15 AM, Scott A Crosby wrote:
With CocoaSqueak 3.2, on a dual 1GHz G4/512MB total RAM, 128MB allocated to Squeak, image size 17MB.
The parameters that were varied were VM parameter 5/6, using the following doit (parameters varied accordingly...)
[*] You also changed macrobenchmarks to not allocate all-but-10mb of RAM before starting the benchmark?
Default parameters 4K/2K: 128262 , 130944, 127643
Parameters increased to 40K/12K: 141371, 141350, 148059
Smaller increase, smaller slowdown 10K/12K: 136736
Parameters accidentally decreased to 4K/ 1.2K: 125945
Intriguied, I set param 6 even lower, to 4K/1K and got: 124051, 124079
Verry interesting.. Caching issues for the CPU cache? My CPU is only half the speed of yours, so caching issues may be more important to you than me.. I've also got a different architecture. What are the times spent on GC for the respective sizes? Is GC using less time, but the overall benchmark getting worse?
I tried roughtly the same type of analysis to see if caching issues had any effect and didn't see any.
The main effect I saw was lower GC times when I set them very high and disabled [*] above to allow it to use all the RAM.
Also, I used much larger numbers: 400k/12k (which is the default I use now) 4000k/12k
with 256mb or more allocated to squeak.
From these measurements, the sweet spot seems to be at somewhat lower settings than the current default, at 4K/1K.
Yup.. Experimentation is needed. THanks for doing this on the PPC, and I'm sorry for not believing you.
Scott
On Tuesday, February 19, 2002, at 07:03 PM, Scott A Crosby wrote:
[*] You also changed macrobenchmarks to not allocate all-but-10mb of RAM before starting the benchmark?
Of course.
Default parameters 4K/2K: 128262 , 130944, 127643
Parameters increased to 40K/12K: 141371, 141350, 148059
Smaller increase, smaller slowdown 10K/12K: 136736
Parameters accidentally decreased to 4K/ 1.2K: 125945
Intriguied, I set param 6 even lower, to 4K/1K and got: 124051, 124079
Verry interesting.. Caching issues for the CPU cache?
That is my guess. On these modern machines, memory-hierarchy issues (cache vs. memory, memory vs. disk) can often be much more significant than other code efficiency. Very relevant for scientific code, for example.
My CPU is only half the speed of yours, so caching issues may be more important to you than me.. I've also got a different architecture. What are the times spent on GC for the respective sizes? Is GC using less time, but the overall benchmark getting worse?
Generally, yes.
I tried roughtly the same type of analysis to see if caching issues had any effect and didn't see any.
The main effect I saw was lower GC times when I set them very high and disabled [*] above to allow it to use all the RAM.
Also, I used much larger numbers: 400k/12k (which is the default I use now) 4000k/12k
with 256mb or more allocated to squeak.
Well, a 10:1 memory overhead seems a bit excessive to save a couple of percent of total running time. My guess is that there are more effective ways we could utilize that sort of memory...
Also, on a system like Mach (most Unixes would probably be similar), gratuitous memory use will typically have other degrading effects (displacing cached files or other programs, or yourself getting displaced, ping-ping fashion).
From these measurements, the sweet spot seems to be at somewhat lower settings than the current default, at 4K/1K.
Yup.. Experimentation is needed. THanks for doing this on the PPC, and I'm sorry for not believing you.
That's OK. it got me to get some real numbers... ;-)
Marcel
On Tuesday, February 19, 2002, at 12:37 AM, Scott A Crosby wrote:
We *know* that increasing these parameters makes macroBenchmarks go faster.
We *really* do?
Yep.. Try increasing them, giving squeak a couple hundred megs of RAM, and running macrobenchmarks.
Mmm this is all interesting and good. However based on past experiences with the VisualWorks GC I can say the first thing you need to do is understand the behavior of the current GC. IE how it runs and what it is doing. This can only be done by collecting hard statistics on what the GC is doing.
In the past I've discussed doing something about this. I guess if nothing else comes up I'll look into it. The first step is actually collect a lot more numbers at interesting places in the GC to understand what it is doing. Based on that, and also having code changes to the VM in place to affect changes we can better decide what is best.
For example a change I made a year ago just to understand some things was looking at how long each IGC took, then adjusting the object allocation count up/down based on that metric, versus an indicated target.
standardTime: aBlock "Times the execution of aBlock in milliseconds, under the following standard conditions: exactly 10Mb of free space is available and compacted, and the recent VM statistics are reset immediately before execution." | spaceLeft tieDown | spaceLeft _ Smalltalk garbageCollect. spaceLeft < 1e7 ifTrue: [self error: 'not enough space for standard conditions']. tieDown _ ByteArray new: spaceLeft - 1e7. "Leave exactly 10MB free"
Now the problem comes in when the VM can autosize, but the autosizing doesn't take Virtual Memory considerations into account. The Smalltalk garbageCollect returns the maximum size we can grow to, versus say a size that won't cause page swapping.
So on my mac this returns
517759868
As you can see we then run off and allocate 517759868 - 10000000.
But on a machine that only *has* 512MB of memory, lots of things are paged out and a historically significant event called *page thrashing* occurs.
Now I think Dan wrote this, and perhaps it needs to be rethinked.
I'm not sure on unix machines you can really ask what the 'safe' limit it, after all the objective of VM operating systems is to allow you to run applications that don't fit into real memory boundaries.
So I'll welcome a change set
Also the max size is used in ImageSegment>>copySmartRootsExport: rootArray for some decision on characteristics of an algorithm to be run.
On Tuesday, February 19, 2002, at 02:40 AM, John M McIntosh wrote:
[standardTime checking for available memory and grabbing most of it]
But on a machine that only *has* 512MB of memory, lots of things are paged out and a historically significant event called *page thrashing* occurs.
Yes ;-)
Incidentally, this will also happen on a system with more RAM if that RAM is actually used by other applications. With an OS like Mach, which always tries to utilize all available RAM (minus a safety), grabbing memory like that is always a significant detriment to overall performance, because that memory won't be avilalbe for disk caching etc.
Now I think Dan wrote this, and perhaps it needs to be rethinked.
Maybe. OTOH, I think it points out a flaw in the memory-grow logic, namely...
I'm not sure on unix machines you can really ask what the 'safe' limit it,
...exactly! That was one of the points I tried to get across in my discussion with Andreas, although I think I didn't do a very good job of it. What is safe is an extremely fluid concept.
At the very least, memory you use is not available to the system for other things such as file mapping. Mach memory maps all files, even the ones accessed via read()/write() and will use all of memory as a disk cache. It will keep files in memory even if they aren't currently being accessed. So there is no such thing as zero cost.
Somewhat more costly (overall) is when you start displacing other programs data or code. At first the inactive parts, and as long as you're only displacing 'clean' pages such as code or read-only data, things are still fairly cheap. If you displace 'dirty' data, things start getting expensive because you have to move stuff to disk. It is still somewhat OK if that data is not part of the active set, because then you page it out to the swapfile and forget about it.
Displacing active memory starts getting nasty, because then you have to page it back in soon, with the worst being active read/write memory, because you have to both page it out and read it back in.
So I think John is right in saying that the term 'lmit' is largely meaningless on such systems, at least for those limits that are readily available. The lower limit of currently available free memory is meaningless because a good VM subystem will keep this close to zero. If Squeak is the only significant process running, ( installed real memory - headrom ) may have some meaning, though you wouldn't really want to grow that much gratuitously. Available swap-space is also largely meaningless ( 20+ gig on my system) because a system like Squeak that actually walks through all of its memory every once in a while (full GC) will page-thrash way before that.
Meaningful parameters are probably best expressed in some kind of weighed pressure model. A small pressure against Squeak keeps it from gratuitously filling memory. If fullGCs become more frequent, the pressure exerted by Squeak to grow its memory space increases. Counter-pressure increases with the paging rate and as (real-memory - headroom) is approached. One potentially very elegant way of handling this sort of thing would be to have Squeak involved in managing its own memory via an external pager. Let's hope this facility gets re-enabled at some point.
after all the objective of VM operating systems is to allow you to run applications that don't fit into real memory boundaries.
...while keeping the working set in real memory. One problem with Squeak (and other GCed systems) is the full-scan that is performed by the copying/compacting fullGCs. This means that when significant memory activity is happening, which is also the point at which the VM would be most useful, you defeat the VM by making all of your memory the working set, and what's worse, a read/write working set. (OTOH, copying GC is probably good for incremental GC because it keeps the hot/active area constant and therefore inside the CPU caches).
I am pretty sure that a non-copying GC would significantly reduce the (real) memory requirements of Squeak on machines with decent VM systems. Better yet would be one that can avoid scanning as well. I think the Boehm collector has an option for using MMU hardware to avoid unnecesary scans. With that, not only would stuff that's not currently used not occupy real memory, it would also just sit on disk without causing any paging or other activity.
So I'll welcome a change set
;-)
Marcel
John,
I've been running into this problem at various times - turns out that #standardTime: on any recent Windows machine will give you anything _but_ standard conditions (due to swapping etc). The best thing you can do is just ignore it ;-) With respect to #copySmartsRoot: you will notice that there's an upper limit set for the initial size, so that actually works quite well.
Cheers, - Andreas
-----Original Message----- From: squeak-dev-admin@lists.squeakfoundation.org [mailto:squeak-dev-admin@lists.squeakfoundation.org] On Behalf Of John M McIntosh Sent: Tuesday, February 19, 2002 2:41 AM To: squeak-dev@lists.squeakfoundation.org Subject: A problem with standardTime:
standardTime: aBlock "Times the execution of aBlock in milliseconds, under the following standard conditions: exactly 10Mb of free space is available and compacted, and the recent VM statistics are reset immediately before execution." | spaceLeft tieDown | spaceLeft _ Smalltalk garbageCollect. spaceLeft < 1e7 ifTrue: [self error: 'not enough space for standard conditions']. tieDown _ ByteArray new: spaceLeft - 1e7. "Leave exactly 10MB free"
Now the problem comes in when the VM can autosize, but the autosizing doesn't take Virtual Memory considerations into account. The Smalltalk garbageCollect returns the maximum size we can grow to, versus say a size that won't cause page swapping.
So on my mac this returns
517759868
As you can see we then run off and allocate 517759868 - 10000000.
But on a machine that only *has* 512MB of memory, lots of things are paged out and a historically significant event called *page thrashing* occurs.
Now I think Dan wrote this, and perhaps it needs to be rethinked.
I'm not sure on unix machines you can really ask what the 'safe' limit it, after all the objective of VM operating systems is to allow you to run applications that don't fit into real memory boundaries.
So I'll welcome a change set
Also the max size is used in ImageSegment>>copySmartRootsExport: rootArray for some decision on characteristics of an algorithm to be run.
--
==============================================================
John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
==============================================================
At 5:50 PM -0500 2/18/02, Scott A Crosby wrote:
On Mon, 18 Feb 2002, Marcel Weiher wrote:
On Monday, February 18, 2002, at 09:17 PM, Scott A Crosby wrote:
Roughly, I profile it at:
Can GC 60mb in 170ms, or about 360mb (6m objects) in a second.
Hmm.... 170 ms * 6 = 1020 ms, or about a second. If these numbers are accurate, there doesn't seem to be any overall performance benefit from delaying the GC (apart from completely avoiding it in a specific period of time). Or very likely I am missing something.
These are raw numbers, and inconsistent with each other.. I get fullGC about 4x-8x slower than a incrGC on the same number of bytes.
Can incrGC 300MB in 1600ms. Can fullGC 20MB in 400ms
Incremental GC's usually have less work to do, since most objects die young. Thus, there are fewer objects to trace and fewer bytes to move during compaction. A full GC has to trace all the objects in the object memory and, if an object low in memory has died, it must move all the objects above that object down during compaction.
-- John
squeak-dev@lists.squeakfoundation.org