A few low-level Pentium II performance measurements

Thu Feb 18 02:26:25 UTC 1999

I took a quick look at the processor performance counters a bit with Squeak
2.3 running on a Pentium-II-266. Note that super care was not used to
create exactly reproducable results. I just though people might be
interested in ballpark numbers. Running the highly official benchmark (not):

100 timesRepeat:[0 tinyBenchmarks]

Average processor clock cycles/instruction retired = 1.10 (0.98)
Average instructions retired/branch instructions retired = 5.33 (8.59)
Average branch table buffer misses / branch instructions retired  = 0.43
(0.20)
Average data memory references / instructions retired = 0.57 (0.50)
Average instructions retired / instructions decoded = 0.61 (0.87)

A wordy description when running Smalltalk code is: the Pentium II runs at
a bit under 1 useful instruction per clock cycle, with about every 5
instructions executed being a branch. Nearly half of these branches do not
take advantage of the branch target buffer, incurring a performance
penalty. Only about 2/3 of the instructions started give results that are
kept.

I didn't offhand see any way to measure L1 cache hit ratio or TLB miss ratio.

The numbers in parenthesis are the measurements while running some very
optimized processor intensive C code, Motion-JPEG video decompression
specifically.

Some interesting observations seem to be:

1) The instructions retired per clock isn't that much worse for the Squeak
interpreter code.

2) The branch target buffer logic works much better for C code.

3) Both Squeak and C code can accesses memory a lot.

Because the C code seemed to not have much faster instruction/sec rates,
yet seemed to throw away a lot fewer instructions (decoded to retired
ratio), I believe there must have been significant slowdowns of the C code.
Some measurements suggested the C code was in a processor "stalled on
resources" 20% of the total clocks, but Squeak was only stalled 5% of the
clocks. This may be an indirect indication of cache miss activity,
especially since the C example code was pretty memory intensive. So this
isn't exactly a great apples to apples comparison to C code. Still, even if
the C code was stalled 0% of the time, it's retire rate would not be that
much faster.

I also ran a dozen different programs and could not find a single one that
achieved as high an instruction retire rate as the Squeak and video
decompression tests. Other programs included, 3-D rendering (FP
intensive?), and postscript to PDF conversion (Acrobat Distiller). This was
very puzzling. 

The performance of the test machine to '0 tinyBenchmarks' was '12195121
bytecodes/sec; 671316 sends/sec'. As it's doing 1.1 clocks per machine
instruction, this implies an average of 266,000,000 / 1.1 / 12195121 =
19.82 machine instructions per bytecode, which seems a bit high to me.
Looking at the generated machine code (thanks to VTune disassembly) I count
around 12 instructions for a really simple bytecode+dispatch, so mabey the
data is correct. A hand crafted assembly interpreter could lose about half
those instructions.

According to the comments on Integer tinyBenchmarks, a 292 MHz G3 Mac:
22727272 bytecodes/sec; 984169 sends/sec. This is quite a lot faster than
the Pentium II, even adjusting for clock speed.

I'd be interested in seeing the G3 generated assembly fragments for
bytecodes like "push constant 0" and the bytecode dispatching loop. It
would be interesting to decide if the G3 compiler/instruction set generates
better code or if the G3 processor is just much faster at executing similar
code.

Hope this has been entertaining.

- Jan

___________________________________________________________________
            Paradigm Matrix Inc., San Ramon California
   "video products and development services for Win32 platforms"
Internet: Jan Bottorff janb at pmatrix.com
          WWW          http://www.pmatrix.com
Phone: voice  (925) 803-9318
       fax    (925) 803-9397
PGP: public key  <http://www-swiss.ai.mit.edu/~bal/pks-toplev.html>
     fingerprint  52 CB FF 60 91 25 F9 44  6F 87 23 C9 AB 5D 05 F6
___________________________________________________________________