More Profiling

Dan Ingalls Dan at SqueakLand.org
Tue Sep 25 17:51:52 UTC 2001


Scott A Crosby <crosby at qwes.math.cmu.edu> wrote...

>I have identified it through basic block level profiling. I am pretty sure
>that the cause of this slowdown is the branch prediction failing. The hit
>rates for the 3 probes are approximately:
>
>P(a lookup gets a hash hit on this probe)
> P(1) = .65  (branch probability .65)
> P(2) = .25  (branch probability .65)
> P(3) = .06  (branch probability .60)
>
>Also, P(we have a collission where both of them have the same class) =.06.
>
>Which is frankly astounding. That means that 6% of the time,
>methodCache[t5 + 1] DOES MATCH  messageSelector, BUT
>  methodCache[t5 + 2] DOES NOT MATCH lkupClass
>
>This is simple morphic programming, note that as the cache only has 512
>entries, that means that if drawing the screen requries more than
>approximately 330 method*class pairs to draw the screen, we'll thrash the
>cache. This is fairly aparent given the cache replacement rate, (4
>thousand/sec of runtime)
>
>Furthermore, the fact that the branch proabilities are so uniform
>indicates that more probes&caching would likely improve the hit rate.

Scott, I'm really enjoying this inquiry.

Your study is interesting, because I never considered the effects of branch prediction in the current design.  The current management of the method cache was designed to minimize thrashing which is done by intentionally spreading entries in their reprobe chains.  The idea is that if even if two items collide once, it is extremely unlikely that they will collide again if placed elsewhere in their reprobe chains.  So its figure of merit is freedom from contention among multiple entries of frequent access.

Another thing to note is that 4k cache loads per sec may not be a huge value.  My machine does 1.2M sends per second, so 4k would be excellent performance, right?

>...
>But, if just increasing the size helps, I'd suggest leaving it at that.

I think this probably makes sense, especially for VMs that are not minimized to running on PDAs and the like.  Before doing it, it would be worth getting actual stats, though, because I think we did a fair amount of measurement when we left the value that low.  The new macroBenchmarks should help in this regard.

Given your points about branch prediction, it is probably also worth revisiting the issue of whether to go with a more conventional approach with a larger cache.  Andreas did some experiments with this a couple of years ago, and it was certainly competitive.

>Though, also, I should point out that even though the actual cache is 512
>entries, the effective cache is 2/3 that, from the basic block profiling
>statistics of the addnewmethodtocache.  This is because we clean out all 3
>probe points upon a miss.
>
>So, thus, we're now about 30% memory utilization. (we lost 30% of wasted
>space in the records, and another 30% of the entries will be empty).

... so this may mean that the simpler scheme, even if it has worse collision stats, might win overall.

>Anyone know how to fiddle with the C-generator to get it to generate C
>code with 0-starting arrays that are aligned on a 16-byte boundry. If
>someone can do that, and help me understand how to test VM stuff, I'll
>write a the new cache.

I can't read the last sentence, but I'm hoping you are going to try a simple cache with good alignment.  I think it's worth a try.  If you do comparisons, it would be nice to align the current cache as well.

>Note that the design should be superior to what is there right now.
>(roughly, twice the number of useful entries in the same memory space).
>But it is intended to be used as a large cache, perhaps, 4k entries
>(96kb), so flushing/invalidating it will be somewhat more expensive,
>meaning that allocations between gc's should be increased.

One nice thing about the current cache is that it can be (and is) selectively flushed, so we don't lose that much context.  If you do do a new scheme with 4k entries, you certainly need to compare it with an equally boosted version of the current one.  I would love to see the macro benchmarks run for the current cache versus whatever scheme you have in mind, for cache sizes of 512, 1k, 2k and 4k.

>How does this sound? Acceptable? Or is Jitter expected soon enough that
>this isn't important.

Jitter already exists, and it makes these stats look silly.  However it's probably fair to say that the interpreter will be around for a long time, and if its cache performance can be improved without too much trouble (especially if it ends up simpler), then it's probably worth doing so.

Keep up the good work!

	- Dan
-- 




More information about the Squeak-dev mailing list