Squeak Speed (Was: A few low-level Pentium II performance
measurements)
sqrmax at cvtci.com.ar
sqrmax at cvtci.com.ar
Mon Feb 22 00:43:04 UTC 1999
Hi.
>It appears that the x86 gets hit even harder with branch table buffer
>misses. From the numbers posted by Jan Bottorff
> branch instructions / instructions (retired) = 1/5.33 = 0.188
> branch table buffer misses / branch instruction = 0.43
> -> branch table buffer miss / instruction = 0.08 = 12.4
>This means one miss every 12 instructions, which can also limit
>performance.
This can actually grind performance down to a halt. With good pairing and
well chosen instructions, those 12 can be crunched in just 6 cycles, whereas
on the PII a branch misprediction is much costly than 6 cycles. On Squeak, of
the total references made to objects when sending them messages, how many of
them come up to be smallintegers? Snips:
~~
22.2.2 Misprediction penalty (PMMX, PPro and PII)
In the PMMX, the penalty for misprediction of a conditional jump is 4
clocks in the U-pipe,
and 5 clocks if it is executed in the V-pipe. For all other control
transfer instructions it is 4
clocks.
In the PPro and PII, the misprediction penalty is very high due to the long
pipeline. A
misprediction usually costs between 10 and 20 clock cycles. It is therefore
very important to
be aware of poorly predictable branches when running on PPro or PII.
~~
>Overall, it looks like cache misses aren't hurting squeak much.
Unless the object has a 3 word header (what Greg wrote below), in which
case intel processors read ahead but not back, causing up to 2 cache misses
which eat lots of time. Furthermore, by the way intel cpu caches are built, this
scenario is much more costly because cache lines are 32 bytes long, and say
you access an object. Pum, a read of 32 bytes to the cache. Then, you
furtherly determine you have to read one word behind that, and perhaps again not in
the cache. And then it could happen that no lines in the cache are able to get
that (because they are set-associative), so what follows is a cache dump
followed by a cache reload... horrors... snips:
~~
The data cache consists of 256 or 512 lines of 32 bytes each. Each time you
read a data
item which is not cached, the processor will read an entire cache line from
memory. The
cache lines are always aligned to a physical address divisible by 32. When
you have read a
byte at an address divisible by 32, then the next 31 bytes can be read or
written to at almost
no extra cost.
~~
Hint: wonderful for contexts to be aligned at addresses muls of 32.
Hint: long objects will perform much better on this hardware than small
objects.
~~
The cache is set-associative. This means that a cache line can not be
assigned to an
arbitrary memory address. Each cache line has a 7-bit set-value which must
match bits 5
through 11 of the physical RAM address (bit 0-4 define the 32 bytes within
a cache line).
The PPlain and PPro have two cache lines for each of the 128 set-values, so
there are two
possible cache lines to assign to any RAM address. The PMMX and PII
versions have
four. The consequence of this is that the cache can hold no more than two
or four different
data blocks which have the same value in bits 5-11 of the address.
Let me illustrate this by the following piece of code, where ESI holds an
address divisible by
32:
AGAIN: MOV EAX, [ESI]
MOV EBX, [ESI + 13*4096 + 4]
MOV ECX, [ESI + 20*4096 + 28]
DEC EDX
JNZ AGAIN
The three addresses used here all have the same set-value because the
differences
between the truncated addresses are multipla of 4096. This loop will
perform very poorly on
the PPlain and PPro. At the time you read ECX there is no free cache line
with the proper
set-value so the processor takes the least recently used of the two
possible cache lines,
that is the one which was used for EAX, and fills it with the data from
[ESI + 20*4096] to
[ESI + 20*4096 + 31] and reads ECX. Next, when reading EAX, you find that
the cache
line that held the value for EAX has now been discarded, so you take the
least recently used
line, which is the one holding the EBX value, and so on.. You have nothing
but cache
misses and the loop takes something like 60 clock cycles. If the third line
is changed to:
MOV ECX, [ESI + 20*4096 + 32]
then we have crossed a 32 byte boundary, so that we do not have the same
set-value as in
the first two lines, and there will be no problem assigning a cache line to
each of the three
addresses. The loop now takes only 3 clock cycles (except for the first
time) - a very
considerable improvement!
Reading a data item which is not in the level one cache causes an entire
cache line to be
filled from the level two cache, which takes approximately 200 ns (that is
20 clocks on a
100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for
first are
available already after 50-100 ns. If the data item is not in the level two
cache either, then
you will get a delay of something like 200-300 ns. This delay will be
somewhat longer if you
cross a DRAM page boundary.
~~
Doesn't all this go completely against Smalltalk? I mean, I recently found
myself digging up in the implementation of the CPU... what would really had
to be well designed is the CPU, and I shouldn't be looking inside it to see
what's going on, so the CPU is not being thought of as an object properly.
Andres.
More information about the Squeak-dev
mailing list
|