On the effect of branch mispredictions in the Squeak VM

Sat Jul 5 18:03:18 UTC 2003

Hi all (real and wannabe ;-) VM hackers,

David Reed just pointed me to a paper by Ertl and Gregg called "Optimizing
Indirect Branch Prediction Accuracy in Virtual Machine Interpreters" (see
http://www.complang.tuwien.ac.at/papers/) which made me wonder how much time
Squeak spends in branch misprediction.

I will assume that you are familiar with the basics of "switch-based"
dispatches and "threaded" dispatches (and hopefully, both direct and
indirect threading) for dynamic interpreters (if not, read the above paper).
As it happens, I believe that most Squeak VMs today are compiled using
indirect threading in the VM (magically applied through the gnuify script -
this was the primary reason for me to switch to GCC on Windows).

The change to threaded dispatch bought us about a factor of two for bytecode
execution yet I was somewhat sceptical about the claim of a speedup of
another factor of two when eliminating (even more) branch-mispredictions. So
I set up a little benchmark (which is attached) that runs the same bytecodes
with some re-ordering (applied to maximize/minimize the effect of the
dispatch) and run it. Here are the results:

iterations		well-predicted(msecs)	mispredicted(msecs)
10			0				0
100			0				0
1000			0				1
10000			3				7
100000		24				68
1000000		247				673
10000000		2463				6722

The interesting fact of the matter is that we spend about 2.7 times as much
when we have branch mispredictions as without them. I'd be delighted if
someone could review the benchmarks itself in order to ensure that I haven't
done anything wrongly and that the results itself are valid.

How does this compare to the overall bytecode speed we measure through,
e.g., something like #tinyBenchmarks? The benchmarks contain 45 bytecodes in
the loop (hand-counted; again someone please double-check and correct me)
which means we have 450,000,000 bytecodes in 2.463 secs (well predicted) vs.
6.722 (mispredicted); resulting in:

	(450000000 / 2.463) truncated asStringWithCommas
 		=> 182,704,019 bytecodes/sec
	(450000000 / 6.722) truncated asStringWithCommas
 		=> 66,944,361 bytecodes/sec

Comparing this to the measure obtained by #tinyBenchmarks on the same
machine

 	117,323,556 bytecodes/sec
	3,377,222 sends/sec

makes it look as if we may spend a significant amount of time in those
mispredictions.

The above brought back a thought I had for a long time, namely how effective
it would be to bring Ian's dynamic interpreter (back from the Squeak 2.x
days) back to live and try to combine a simple direct threaded interpreter
with eliminating branch mispredictions (for most of the "small" bytecodes
which can easily be duplicated) in Squeak itself.

Ian's dynamic interpreter had a couple of really interesting things in it
which could be brought back to live too (depending on how much work one
wants to spend) but just the effect on eliminating branch mispredictions
should be interesting. So I was wondering if there's someone out there who
might be interested (and knowledgable) in the field and write a simple
direct-threaded interpreter which uses bytecode replication and see where
this gets us.

Anyone interested?

Cheers,
  - Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BranchPrediction-ar.2.cs
Type: application/octet-stream
Size: 1405 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20030705/30b98bf1/BranchPrediction-ar.2.obj