A few low-level Pentium II performance measurements

Wed Feb 24 22:40:00 UTC 1999

Jan Bottorff <janb at pmatrix.com> wrote:
> The performance of the test machine to '0 tinyBenchmarks' was '12195121
> bytecodes/sec; 671316 sends/sec'. As it's doing 1.1 clocks per machine
> instruction, this implies an average of 266,000,000 / 1.1 / 12195121 =
> 19.82 machine instructions per bytecode, which seems a bit high to me.
> Looking at the generated machine code (thanks to VTune disassembly) I count
> around 12 instructions for a really simple bytecode+dispatch, so mabey the
> data is correct. A hand crafted assembly interpreter could lose about half
> those instructions.

Jan,

Very interesting!

You might want to look at the two 'tiny' benchmarks separately.
The "10 benchmark" one is basically a tight loop that does a lot of
array accessing but has no full message sends in the loop. The
"26 benchFib" one is very send-heavy, with few bytecodes between
sends. Message send/return is a few hundred machine instructions.
The average bytecode, with dispatch, is, as you discovered, about
12 instructions. I think your figure of "19.82 machine instructions
per bytecode" is a weighted average of these two figures. 

> I'd be interested in seeing the G3 generated assembly fragments for
> bytecodes like "push constant 0" and the bytecode dispatching loop. It
> would be interesting to decide if the G3 compiler/instruction set generates
> better code or if the G3 processor is just much faster at executing similar
> code.

Here you are, with annotations (which you probably don't need, but
may interest other folks)!!

Hunk:	Kind=HUNK_GLOBAL_CODE    Align=4  Class=PR  Name=".interpret"(778)  Size=13268
00000000: 7C0802A6  mflr     r0
00000004: BDA1FFB4  stmw     r13,-76(SP)
00000008: 90010008  stw      r0,8(SP)
0000000C: 9421FF70  stwu     SP,-144(SP)
00000010: 83420000  lwz      r26,instructionPointer(RTOC)
00000014: 83020000  lwz      r24,theHomeContext(RTOC)
00000018: 83820000  lwz      r28,stackPointer(RTOC)
0000001C: 827A0000  lwz      r19,0(r26)
00000020: 82C20000  lwz      r22,messageSelector(RTOC)
00000024: 82E20000  lwz      r23,successFlag(RTOC)
00000028: 81A20000  lwz      r13,receiver(RTOC)
0000002C: 83220000  lwz      r25,method(RTOC)
00000030: 83620000  lwz      r27,argumentCount(RTOC)
00000034: 83A20000  lwz      r29,specialObjectsOop(RTOC)
00000038: 829C0000  lwz      r20,0(r28)
0000003C: 82B80000  lwz      r21,0(r24)
;; The next instruction fetches the bytecode into r18 and increments the PC (r19).
00000040: 8E530001  lbzu     r18,1(r19)
;; Here's the start of the dispatch. note the check for the bytecode being > 255
;; (which is impossible, but the C compiler doesn't realize it). We patch that out
;; on the PPC dispatch, saving 2 instructions, which might explain part of the
;; PPC VM's speed advantage relative to the Pentium VM.
00000044: 281200FF  cmplwi   r18,$00FF
00000048: 4181FFFC  bgt      *-4                     ; $00000044
0000004C: 80620000  lwz      r3, at 5762(RTOC)			;; load jump table base address
00000050: 5640103A  slwi     r0,r18,2					;; shift bytecode left by 2 bits to form word offset and...
00000054: 7C63002E  lwzx     r3,r3,r0					;; ... add it to the jump table base address
00000058: 7C6903A6  mtctr    r3						;; load the jump address in to the jump address register
0000005C: 4E800420  bctr								;; perform the actual indirect jump
;; Here's the first bytecode, pushReceiverVariableBytecode (inst var 1)
;; The first two instructions fetch the next bytecode into r18 and increment
;; the PC (r19), so the real work of this bytecode is done in just two instructions.
00000060: 806D0000  lwz      r3,0(r13)
00000064: 8E530001  lbzu     r18,1(r19)
00000068: 80030004  lwz      r0,4(r3)
0000006C: 94140004  stwu     r0,4(r20)
00000070: 4BFFFFD4  b        *-44                    ; $00000044
;; Here's the second bytecode, pushReceiverVariableBytecode (inst var 2)
00000074: 806D0000  lwz      r3,0(r13)
00000078: 8E530001  lbzu     r18,1(r19)
0000007C: 80030008  lwz      r0,8(r3)
00000080: 94140004  stwu     r0,4(r20)
00000084: 4BFFFFC0  b        *-64                    ; $00000044
....

	-- John