More Profiling

Scott A Crosby crosby at qwes.math.cmu.edu
Tue Sep 25 06:04:58 UTC 2001


> As you fit the same amount of information into less space, you waste less
> CPU cacheline. Given that the single line:
>
> 			if (((methodCache[t5 + 1]) == messageSelector) &&
> ((methodCache[t5 + 2]) == lkupClass)) /* in commonLookup */
>
> is consuming 4.5% of the CPU time, I'd suspect that the CPU is taking L1
> cache mises or the branch prediction unit is mispredicting. I cannot
> easily tell which from the information given.

I have identified it through basic block level profiling. I am pretty sure
that the cause of this slowdown is the branch prediction failing. The hit
rates for the 3 probes are approximately:

P(a lookup gets a hash hit on this probe)
 P(1) = .65  (branch probability .65)
 P(2) = .25  (branch probability .65)
 P(3) = .06  (branch probability .60)

Also, P(we have a collission where both of them have the same class) =.06.

Which is frankly astounding. That means that 6% of the time,
methodCache[t5 + 1] DOES MATCH  messageSelector, BUT
  methodCache[t5 + 2] DOES NOT MATCH lkupClass

This is simple morphic programming, note that as the cache only has 512
entries, that means that if drawing the screen requries more than
approximately 330 method*class pairs to draw the screen, we'll thrash the
cache. This is fairly aparent given the cache replacement rate, (4
thousand/sec of runtime)

Furthermore, the fact that the branch proabilities are so uniform
indicates that more probes&caching would likely improve the hit rate.

Finally, the fact that the first branch is so close to 50/50, means that
the branch prdiction unit absolutely flops and will frequently mispredict
the branch. This is why I suspect why that line requires so much CPU time.
If it was a hit 99% of the time, branch prediction would be much
more reliable.

Even when profiling tinybenchmarks, (now that is something that MUST fit
in the method cache, no questions asked), the miss rate is still .16, so
prediction is still poor. But, the time taken by this line has decreased
roughly proprotionally.

Also, there is no CPU penalty for making a very large method cache.
(except for any potential CPU penalty for clearing it during GC).

This is because if it is (say) 1mb, the CPU caches will hold the most
important entries.

Except in the current scheme, if a frequent method occurs on the second
probe (due to a collission), that will corrupt the branch prbability
statistics. I'm not sure how much of an effect this will have in practice.
(Need to profile the current scheme with a bigger cache to see.)

If, though, it is major, then something like my buddycache system would be
better, where if we take the 2nd or 3rd probe, we swap it to make it the
first one checked next time.

But, if just increasing the size helps, I'd suggest leaving it at that.

Though, also, I should point out that even though the actual cache is 512
entries, the effective cache is 2/3 that, from the basic block profiling
statistics of the addnewmethodtocache.  This is because we clean out all 3
probe points upon a miss.

So, thus, we're now about 30% memory utilization. (we lost 30% of wasted
space in the records, and another 30% of the entries will be empty).


Anyone know how to fiddle with the C-generator to get it to generate C
code with 0-starting arrays that are aligned on a 16-byte boundry. If
someone can do that, and help me understand how to test VM stuff, I'll
write a the new cache.

Note that the design should be superior to what is there right now.
(roughly, twice the number of useful entries in the same memory space).
But it is intended to be used as a large cache, perhaps, 4k entries
(96kb), so flushing/invalidating it will be somewhat more expensive,
meaning that allocations between gc's should be increased.

How does this sound? Acceptable? Or is Jitter expected soon enough that
this isn't important.

Scott





More information about the Squeak-dev mailing list