Now, for branch prediction:
http://developer.intel.com/design/intarch/techinfo/Pentium/operatn.htm#10047...
A branch prediction hit on the intel has a cost of one clock cycle, and may be run in parallel with any other integer instruction, even a comparison!!!!
A miss costs a full pipeline flush. (I do not know the cost of this, but I'd assume at least 6 ticks, maybe over 10 clock cycles.) and another 3 to 4 clock cycles.
That page in the Intel web site is for the embedded Pentium processor. I believe this is a totally different architecture than the P4 (and maybe even the P2 and P3).
The latest P4 optimization info is at:
http://developer.intel.com/design/pentium4/manuals/248966.htm
- Jan