Optimizing Squeak

Greg & Cindy Gritton gritton at ibm.net
Wed Feb 24 03:08:42 UTC 1999

At 08:42 AM 2/23/99 -0800, you wrote:
>>The best alternative I came up with was what I'll call a message send
>>bytecode set.
>Very interesting!
>>The issue is the processor pipleine
>>uses the address of the branch instruction (usually an unconditional jump
>>to a table indexed address) to "predict" what address a branch will go to.
>Oh, I didn't know that. Interesting...  It does this even for a "branch to
address given in a register" instruction (which, I guess, is what it would
be on a RISC processor)?
>I had been under the mistaken impression that "branch prediction" meant
"predicting whether or not a conditional branch will be taken" not
"predicting which is the next instruction to execute after ANY branch so we
can keep the pipeline filled".
>Thanks for explaining that, Jan.
>Then I read Tim Olson's further details:
>>There are at least 2 kinds of Branch Target Caches:
>>Branch Target Address Cache [BTAC] (P6, PowerPC 604e)
>>     this cache is indexed by the address of the branch instruction, and
>>     returns the address of the predicted target
>>Branch Target Instruction Cache [BTIC] (AMD 29K, PowerPC 750)
>>   this cache is indexed by the address of the branch target, and
>>    returns the first few instructions at that target
>So (let me summarize my understanding here), on current Intel chips (P6)
(which uses BTAC) this branch misprediction (for a byte-code dispatch loop
that used the bytecode to index into a 256 element table of addresses)
WOULD have the bad branch misprediction behavior Jan takes about.
>But current PPC's (which use BTIC) wouldn't have this problem in this kind
of byte-code dispatch loop. In fact since the first few instructions to
handle each bytecode would be cached, they should have really good
pipelining behavior for this kind of byte-code dispatch loop.
>FYI: The BrouHaHa interpreter which Eliot Miranda wrote used (if I
remember correctly) what the referenced document
(http://www.complang.tuwien.ac.at/forth/threaded-code.html) would have us
call "direct token threaded code".  On current PPC chips this kind of
byte-code interpreter should also perform well because of the PPC's BTIC.
>Thanks for the additional details Tim!
>P.S. Gee, following this thread is like taking a course in computer

Not quite.  A BTAC (as used in the P6) can be used to predict indirect
if they are consistent.  The bytecode dispatch branch is so inconsistent
that it would almost always be mispredicted.

The BTIC used in the PPC 750 has a different purpose.  By caching the
branch target instructions, these instructions can be decoded on the
cycle following the decode of a branch instruction rather than two
cycles afterwards.  However, it doesn't help predict addresses of
indirect branches, so you always get a branch misprediction on any
indirect branch.  However, the PPC pipeline is so short that branch
mispredictions are much less expensive than on the P6. 

A processor may have a Branch History Table (BHT) in addition to 
a Branch Target Address Cache (BTAC) or Branch Target Instruction Cache
The PPC 750 has both a BHT and BTIC, while the PPC 604e has a BHT
and a BTAC.  The BHT generally stores only a few bits of information
about a branch, so it can cover more branches than a BTAC or BTIC.
It can only predict the direction of conditional branches, not
their address or destination instructions.  It usually takes an
extra cycle to fetch the branch target with a BHT hit than
a BTAC or BTIC hit, but it is still faster than a full mispredict.

I believe BHTs came before BTACs or BTICs, and are probably more
well known.   This  might explain why you, and Dreisen et al
in their paper, assumed that branch prediction can only prediction
the direction of conditional branches.

Greg Gritton


More information about the Squeak-dev mailing list