Cache lines and branch prediction.

Scott A Crosby crosby at qwes.math.cmu.edu
Sun Sep 30 20:00:47 UTC 2001


Whoops. Looks like I am obsolete with my memory of CPU's.. Oh, for the
days when cache lines were 16 bytes. I feel old already.

It looks like you can now assume them to be 32 (Pentium 2) or even 64
bytes.  (AMD Athalon)

So, given the bigger cache line, which we can expect to be big enough to
contain a full entry, my idea of using two tables is no longer
useful; it will lead to using more cache lines because entries will be
split up between lines. (And the other portions of those cache lines will
most likely be wasted.)

Given this, my 'cache conscious' design is no longer relevant, and a
variant of the current scheme will work well:

1. Make the cache hold 2k-16k entries.
2. Upon a cache hit on the first slot, use it as-is.
3. Upon a cache miss on the first probe, but a hit upon a later probe
Replace the first slot probed with the entry of the hit.
4. If all probes miss, lookup the correct method. If there is an
empty slot use it, otherwise replace either the 2nd or 3rd (chosen
randomly) probe with the lookedup information.

Advantages of this: It is still pretty similar to the current scheme and
only requires very minimal changes.  Also, we never add to the second
slot.

Reason:

 #1: Morphic is very heavyweight and uses a lot of selectors, classes, and
methods.

 #3: If it is used a second time before being removed, put it in the first
slot so we notice it quickly.

 #4: We do not want to ever erase an entry.

Note that this scheme still wastes 1/3 of the slots, but it will probably
increase the hit rate, of the full cache (because its a larger cache) and
increase the hit rate of the first probe... Until Jitter comes along and
makes this all irrelevant.


--

Now, for branch prediction:

http://developer.intel.com/design/intarch/techinfo/Pentium/operatn.htm#1004786

A branch prediction hit on the intel has a cost of one clock cycle, and
may be run in parallel with any other integer instruction, even a
comparison!!!!

A miss costs a full pipeline flush. (I do not know the cost of this, but
I'd assume at least 6 ticks, maybe over 10 clock cycles.) and another 3 to
4 clock cycles.

The example, taken from the above:

   for(k=i+prime;k<=SIZE;k+=prime) flags[k]=FALSE;

Can run in 2 clocks/iteration. (plus any cache miss overhead) It runs the
store&add in one tick, and the compare&branch in the second tick.

So a branch prediction miss costs what? About 10x that of a hit.


So, you can see why a branch prediction hit rate of 60% at that critical
if-then is so ugly. Increasing it to (say) 95% theoretically (which is our
hit rate after 3 probes) would be a 3x improvement.

And having a method cache that cannot fit in the CPU cache is not much of
a problem either; the CPU cache will figure out what entries are important
and hold them.


Scott






More information about the Squeak-dev mailing list