A few low-level Pentium II performance measurements
johnm at wdi.disney.com
johnm at wdi.disney.com
Wed Feb 24 22:40:00 UTC 1999
Jan Bottorff <janb at pmatrix.com> wrote:
> The performance of the test machine to '0 tinyBenchmarks' was '12195121
> bytecodes/sec; 671316 sends/sec'. As it's doing 1.1 clocks per machine
> instruction, this implies an average of 266,000,000 / 1.1 / 12195121 =
> 19.82 machine instructions per bytecode, which seems a bit high to me.
> Looking at the generated machine code (thanks to VTune disassembly) I count
> around 12 instructions for a really simple bytecode+dispatch, so mabey the
> data is correct. A hand crafted assembly interpreter could lose about half
> those instructions.
Jan,
Very interesting!
You might want to look at the two 'tiny' benchmarks separately.
The "10 benchmark" one is basically a tight loop that does a lot of
array accessing but has no full message sends in the loop. The
"26 benchFib" one is very send-heavy, with few bytecodes between
sends. Message send/return is a few hundred machine instructions.
The average bytecode, with dispatch, is, as you discovered, about
12 instructions. I think your figure of "19.82 machine instructions
per bytecode" is a weighted average of these two figures.
> I'd be interested in seeing the G3 generated assembly fragments for
> bytecodes like "push constant 0" and the bytecode dispatching loop. It
> would be interesting to decide if the G3 compiler/instruction set generates
> better code or if the G3 processor is just much faster at executing similar
> code.
Here you are, with annotations (which you probably don't need, but
may interest other folks)!!
Hunk: Kind=HUNK_GLOBAL_CODE Align=4 Class=PR Name=".interpret"(778) Size=13268
00000000: 7C0802A6 mflr r0
00000004: BDA1FFB4 stmw r13,-76(SP)
00000008: 90010008 stw r0,8(SP)
0000000C: 9421FF70 stwu SP,-144(SP)
00000010: 83420000 lwz r26,instructionPointer(RTOC)
00000014: 83020000 lwz r24,theHomeContext(RTOC)
00000018: 83820000 lwz r28,stackPointer(RTOC)
0000001C: 827A0000 lwz r19,0(r26)
00000020: 82C20000 lwz r22,messageSelector(RTOC)
00000024: 82E20000 lwz r23,successFlag(RTOC)
00000028: 81A20000 lwz r13,receiver(RTOC)
0000002C: 83220000 lwz r25,method(RTOC)
00000030: 83620000 lwz r27,argumentCount(RTOC)
00000034: 83A20000 lwz r29,specialObjectsOop(RTOC)
00000038: 829C0000 lwz r20,0(r28)
0000003C: 82B80000 lwz r21,0(r24)
;; The next instruction fetches the bytecode into r18 and increments the PC (r19).
00000040: 8E530001 lbzu r18,1(r19)
;; Here's the start of the dispatch. note the check for the bytecode being > 255
;; (which is impossible, but the C compiler doesn't realize it). We patch that out
;; on the PPC dispatch, saving 2 instructions, which might explain part of the
;; PPC VM's speed advantage relative to the Pentium VM.
00000044: 281200FF cmplwi r18,$00FF
00000048: 4181FFFC bgt *-4 ; $00000044
0000004C: 80620000 lwz r3, at 5762(RTOC) ;; load jump table base address
00000050: 5640103A slwi r0,r18,2 ;; shift bytecode left by 2 bits to form word offset and...
00000054: 7C63002E lwzx r3,r3,r0 ;; ... add it to the jump table base address
00000058: 7C6903A6 mtctr r3 ;; load the jump address in to the jump address register
0000005C: 4E800420 bctr ;; perform the actual indirect jump
;; Here's the first bytecode, pushReceiverVariableBytecode (inst var 1)
;; The first two instructions fetch the next bytecode into r18 and increment
;; the PC (r19), so the real work of this bytecode is done in just two instructions.
00000060: 806D0000 lwz r3,0(r13)
00000064: 8E530001 lbzu r18,1(r19)
00000068: 80030004 lwz r0,4(r3)
0000006C: 94140004 stwu r0,4(r20)
00000070: 4BFFFFD4 b *-44 ; $00000044
;; Here's the second bytecode, pushReceiverVariableBytecode (inst var 2)
00000074: 806D0000 lwz r3,0(r13)
00000078: 8E530001 lbzu r18,1(r19)
0000007C: 80030008 lwz r0,8(r3)
00000080: 94140004 stwu r0,4(r20)
00000084: 4BFFFFC0 b *-64 ; $00000044
....
-- John
More information about the Squeak-dev
mailing list
|