do I have to garbageCollect every time I create a largeobject?

Wed Aug 15 17:54:09 UTC 2001

Carlos and Stephen:

At 9:17 PM -0300 8/8/01, Carlos Sarraute wrote:
>Hello,
>
>I have noticed something curious: after creating a very large object
>(Array new: 25000000 which ocupies 100.000.000 bytes of memory),
>everything gets very slow, and the SystemMonitor indicates constant
>incremental garbage collections.
>It is only after performing a full garbage collection (Smalltalk
>garbageCollect) that I can work again normally. My question is: do I
>have to do "Smalltalk garbageCollect" every time I create a large
>object? Or are there other ways to handle this situation?
>
>Thanks,
> Carlos Sarraute

At 9:12 AM -0400 8/9/01, Stephen Pair wrote:
>I've noticed the same thing as well...
>
>Does incremental GC get triggered based on the size of new space?  It
>seems like incremental GC constantly thinks that it needs to run when
>you allocate a huge object.  Looks like the big object has to make it
>into old space before things starting working well again.
>
>- Stephen

Incremental GC's are done every N object allocations (N=4000 by default). They
won't happen any more often if there is a large array in new space, but each incr.
GC will take longer. Why? Because it has to scan that 25,000,000 element array
looking for pointers to objects that need to be traced. Since the array is much
too large to fit into the any of the caches, the scanning speed is bounded
by your memory bandwidth. If you are running in virtual memory, scanning
this array may even cause swapping, which would *really* slow things down.

If you do a full GC, the big array will be "tenured"--that is, moved to old space.
Now it won't need to be scanned on every incremental GC, and things return
to normal--unless you store in that array a reference to a young object. If
you do that, the array will become a "root" object and will need to be scanned
on every incremental GC again. Of course, if you do a full GC, the young object
will become an old object and things will return to normal.

Now, none of these GC issues arise with large objects that can't contain pointers.
For example, a large Bitmap, String, ByteArray, or FloatArray will not cause
problems because the VM doesn't need to scan it for pointers.

Some guidelines for avoiding such aberrant GC behavior are:

1. If appropriate, use the non-pointer classes for large arrays
2. If you need a large array of pointers, do a full GC after you
    allocate it and after every update. This will keep it and everything
    it points to in old space.
3. If you must frequently update a large array with pointers to
    newly created objects, consider breaking the array into smaller
    chunks. If your data set is this large, chances are you'll need
    a more sophisticated addressing scheme than a flat array anyhow.

In practice, this issue is very seldom a problem. Bob Arning did discover one
case of it involving Squeak's Symbol dictionary, but in six years of Squeak
programming, that's the only case that I know of where this has been a
problem for a real application. Even in this case, the earlier Symbol
dictionary design did not have any problems because it broke the
Symbol dictionary into a number of smaller dictionaries.

Of course, recent computers have more physical memory than they used to,
which allows Squeak to deal with much larger amounts of in-memory data.
We may start to see this GC issue arise more often in real applications. There
are GC designs--such as the card-marking scheme used by the Self VM--
that reduce the burden of scanning large objects, but these schemes
add complexity to the GC and perhaps slow down the common cases.
Thus, I recommend sticking with the current GC design unless we see
a lot of real applications that need very large, frequently updated
arrays.

	-- John