Comments on Smalltalk block closure designs, part 1

Fri Apr 27 07:27:14 UTC 2001

on 4/26/01 7:36 PM, Richard A. O'Keefe at ok at atlas.otago.ac.nz wrote:

> Allen_Wirfs-Brock at mail.instantiations.com wrote:
> A 100-to-1 performance hit on a cache miss is a rule of thumb I
> have heard people use for modern processor/memory subsystems.
> 
> The figures I saw for UltraSPARC, back in 1996, were
> load from register => result available 1 cycle later
> L1 cache =>                  2 cycles later
> L2 cache =>                  8 cycles later
> main memory =>              30 cycles later
> for a machine that could run 4 instructions per cycle.  Since then,
> CPU speeds have nearly quadrupled, but memory speeds have not.
> 
> Then there's the point that if you try to recycle contexts in the heap,
> you had better make sure that the garbage collector knows to ignore
> references from slots that are not currently in use (more GC complexity)
> or that they are nilled out on exit (more time overhead).

I prepared a much larger post about this, then noticed something interesting
and started again...

There is no advantage to saving Newspace-allocated context corpses during
scavenging.  Thus, at scavenge-time you can simply throw away your context
freelist.  Alternatively, if you've allocated them in a separate
LIFO-Context space, you won't be scanning the dead ones at scavenge time
anyhow.  Since corpses are never scanned, you don't have to worry about
garbage pointers in them.  Still, a debugging version of the VM might want
to fill context corpses with DEADBEEF (hex) for your own sanity.

Say you nil the slots at allocation time.  Some architectures allow you to
zero an entire cache line with one instruction (to avoid the latency and
traffic of a memory read when you intend to clobber all the data anyhow).
To avoid chocking the write queue, simply use this technique to get the
cache to believe you have current data for the data block that's, say, 64
bytes ahead of where you're actually allocating.  That's probably small
enough that the writes (due to the zeroing) won't get flushed out before the
memory actually gets used to house an object.  Leave a 128 byte headroom at
the end of Newspace that never gets written to.  Don't worry about precisely
zeroing every data block, just don't ever zero a data block more than 64
bytes beyond the end of allocatable Newspace.  Most objects are small, so a
single zero-data-block instruction per object allocation should be plenty.
Redundant zeroing of the same data block (due to very small objects)
probably aren't too expensive, since they only affect the innermost cache.

If you nil the slots at context-exit time, you run the risk that a scavenge
could happen, wasting the nilling effort (i.e., throwing away the context
corpse -- it's still cheaper than *copying* it to ToSpace).  Also, by the
time you reuse the object the cache line might have been reused for other
data, so you'd have to read it back from the next layer of cache.

>From this I conclude that allocation time is when nilling should happen.

Something else:

Someone in this thread (I think it was Ian Piumarta) mentioned that
basically there is no such thing as "locality of reference" between cache
lines.  This is due to the fact that this kind of cache uses associative
lookup (typically with some kind of "wired-or" if MY associative memory
serves me).  He then stated that until a memory boundary was reached, the
cache miss cost stayed pretty low, then suddenly spiked way up.  As a worst
case, consider swapping to a rotating disk.  A disk read of 10ms is
***1,000,000 times slower*** than accessing cheap 100MHz (=10ns) memory
(approximately).  I've heard horror stories from the "good ol' days" in
which the code was *executed* directly from drum memory.  Brrr.

Anyhow, I've been thinking of ways to make garbage collection of
<sarcasm>involuntarily persisted</sarcasm> objects quicker.  If a CPU
running in User Mode is allowed to test the current permissions of a page
(without triggering a page fault), a garbage collector could simply do its
best to mark what's currently in memory while building up a list of what to
fetch later from disk.  In theory, this list could be fed to an OS for
page-prefetching.  The OS might even re-order the reads to minimize latency.
This would greatly reduce the insane amount of thrashing that current GC's
exhibit when hitting the memory/disk barrier.  It would also allow the GC to
continue to do useful work while the disk stays busy prefetching pages.
Current GC's will block the whole thread for 10ms until a needed (random)
page is completely fetched, then do the same thing a microsecond later with
another page.  Meanwhile, a page gets swapped *out* to make room, and a few
milliseconds later it's needed again...  The working set is just too large
to access randomly when you know almost every page will be accessed many
times per GC.

Has anyone heard of any work being done in this vein?  If the rate of
increase of processor speeds continues to be significantly higher than the
rate of increase of memory speed, this idea might soon become applicable at
the cache/main-memory boundary too.  Tied to a futuristic
multiple-path-speculative-execution, multi-threaded processor core, this
might stretch the concept of out-of-order execution to its extreme.

-Mark