Comments on Smalltalk block closure designs, part 1

Fri Apr 27 13:46:17 UTC 2001

Mark van Gulik wrote:

>Someone in this thread (I think it was Ian Piumarta) mentioned that
>basically there is no such thing as "locality of reference" between cache
>lines.  This is due to the fact that this kind of cache uses associative
>lookup (typically with some kind of "wired-or" if MY associative memory
>serves me).  

Cache associativity has some effect, here, but it's rare to see highly- 
or fully-associative caches (the ones I can think of are TLBs on MIPS 
processors, and the L1 caches on some ARM designs).  Most caches these 
days have a small amount of associativity, from 1-way "direct mapped" to 
4-way or 8-way.  The future trend is towards on-chip cache hierarchies 
with small, direct-mapped caches at the top and large, 4 to 8-way caches 
at the bottom.

All small-associativity caches have problematic access patterns which 
cause conflict misses that can lead to greatly reduced performance, even 
when the working set being accessed is smaller than the total cache size. 
 An interesting idea would be to collect "hot spot" access statistics 
while running, for use by the garbage collector to dynamically remap 
heavily-used areas to different cache sets to avoid thrashing.

>He then stated that until a memory boundary was reached, the
>cache miss cost stayed pretty low, then suddenly spiked way up.  

That's the transition from conflict misses, where the working set is 
small enough to fit in the cache, to capacity misses, where the 
short-term working set exceeds the cache's total capacity.

>If a CPU
>running in User Mode is allowed to test the current permissions of a page
>(without triggering a page fault), a garbage collector could simply do its
>best to mark what's currently in memory while building up a list of what to
>fetch later from disk.

>If the rate of
>increase of processor speeds continues to be significantly higher than the
>rate of increase of memory speed, this idea might soon become applicable at
>the cache/main-memory boundary too.

It already is: consider the 2GHz processors that will be available in the 
not-too-distant future.  An L1 data cache access will take 1 to 2 cycles 
(0.5ns to 1ns), an on-chip L2 access will take around 20 cycles (10ns), 
but main memory access to a non-open DRAM page will still be at least 
60ns.

Future processor designs are attempting to address this in a way similar 
to what you propose, by incoporating the DRAM controller directly on 
chip, and coordinating it with the cache, reordering the cache <-> memory 
traffic.

     -- Tim Olson