Comments on Smalltalk block closure designs, part 1
Tim Olson
tim at jump.net
Fri Apr 27 13:46:17 UTC 2001
Mark van Gulik wrote:
>Someone in this thread (I think it was Ian Piumarta) mentioned that
>basically there is no such thing as "locality of reference" between cache
>lines. This is due to the fact that this kind of cache uses associative
>lookup (typically with some kind of "wired-or" if MY associative memory
>serves me).
Cache associativity has some effect, here, but it's rare to see highly-
or fully-associative caches (the ones I can think of are TLBs on MIPS
processors, and the L1 caches on some ARM designs). Most caches these
days have a small amount of associativity, from 1-way "direct mapped" to
4-way or 8-way. The future trend is towards on-chip cache hierarchies
with small, direct-mapped caches at the top and large, 4 to 8-way caches
at the bottom.
All small-associativity caches have problematic access patterns which
cause conflict misses that can lead to greatly reduced performance, even
when the working set being accessed is smaller than the total cache size.
An interesting idea would be to collect "hot spot" access statistics
while running, for use by the garbage collector to dynamically remap
heavily-used areas to different cache sets to avoid thrashing.
>He then stated that until a memory boundary was reached, the
>cache miss cost stayed pretty low, then suddenly spiked way up.
That's the transition from conflict misses, where the working set is
small enough to fit in the cache, to capacity misses, where the
short-term working set exceeds the cache's total capacity.
>If a CPU
>running in User Mode is allowed to test the current permissions of a page
>(without triggering a page fault), a garbage collector could simply do its
>best to mark what's currently in memory while building up a list of what to
>fetch later from disk.
>If the rate of
>increase of processor speeds continues to be significantly higher than the
>rate of increase of memory speed, this idea might soon become applicable at
>the cache/main-memory boundary too.
It already is: consider the 2GHz processors that will be available in the
not-too-distant future. An L1 data cache access will take 1 to 2 cycles
(0.5ns to 1ns), an on-chip L2 access will take around 20 cycles (10ns),
but main memory access to a non-open DRAM page will still be at least
60ns.
Future processor designs are attempting to address this in a way similar
to what you propose, by incoporating the DRAM controller directly on
chip, and coordinating it with the cache, reordering the cache <-> memory
traffic.
-- Tim Olson
More information about the Squeak-dev
mailing list
|