Squeak Speed (Was: A few low-level Pentium II performance measurements)

Mon Feb 22 00:43:04 UTC 1999

Hi.

>It appears that the x86 gets hit even harder with branch table buffer
>misses.  From the numbers posted by Jan Bottorff
>    branch instructions / instructions (retired) = 1/5.33 = 0.188
>    branch table buffer misses / branch instruction =       0.43
>    -> branch table buffer miss / instruction       =       0.08  = 12.4
>This means one miss every 12 instructions, which can also limit
>performance. 

This can actually grind performance down to a halt. With good pairing and 
well chosen instructions, those 12 can be crunched in just 6 cycles, whereas 
on the PII a branch misprediction is much costly than 6 cycles. On Squeak, of 
the total references made to objects when sending them messages, how many of 
them come up to be smallintegers? Snips:

~~
22.2.2 Misprediction penalty (PMMX, PPro and PII)
In the PMMX, the penalty for misprediction of a conditional jump is 4 
clocks in the U-pipe, 
and 5 clocks if it is executed in the V-pipe. For all other control 
transfer instructions it is 4 
clocks.

In the PPro and PII, the misprediction penalty is very high due to the long 
pipeline. A 
misprediction usually costs between 10 and 20 clock cycles. It is therefore 
very important to 
be aware of poorly predictable branches when running on PPro or PII.
~~

>Overall, it looks like cache misses aren't hurting squeak much.

Unless the object has a 3 word header (what Greg wrote below), in which 
case intel processors read ahead but not back, causing up to 2 cache misses 
which eat lots of time. Furthermore, by the way intel cpu caches are built, this 
scenario is much more costly because cache lines are 32 bytes long, and say 
you access an object. Pum, a read of 32 bytes to the cache. Then, you 
furtherly determine you have to read one word behind that, and perhaps again not in 
the cache. And then it could happen that no lines in the cache are able to get 
that (because they are set-associative), so what follows is a cache dump 
followed by a cache reload... horrors... snips:

~~
The data cache consists of 256 or 512 lines of 32 bytes each. Each time you 
read a data 
item which is not cached, the processor will read an entire cache line from 
memory. The 
cache lines are always aligned to a physical address divisible by 32. When 
you have read a 
byte at an address divisible by 32, then the next 31 bytes can be read or 
written to at almost 
no extra cost.
~~
Hint: wonderful for contexts to be aligned at addresses muls of 32.
Hint: long objects will perform much better on this hardware than small 
objects.
~~

The cache is set-associative. This means that a cache line can not be 
assigned to an 
arbitrary memory address. Each cache line has a 7-bit set-value which must 
match bits 5 
through 11 of the physical RAM address (bit 0-4 define the 32 bytes within 
a cache line). 
The PPlain and PPro have two cache lines for each of the 128 set-values, so 
there are two 
possible cache lines to assign to any RAM address. The PMMX and PII 
versions have 
four. The consequence of this is that the cache can hold no more than two 
or four different 
data blocks which have the same value in bits 5-11 of the address.

Let me illustrate this by the following piece of code, where ESI holds an 
address divisible by 
32:

AGAIN:  MOV  EAX, [ESI]
        MOV  EBX, [ESI + 13*4096 +  4]
        MOV  ECX, [ESI + 20*4096 + 28]
        DEC  EDX
        JNZ  AGAIN

The three addresses used here all have the same set-value because the 
differences 
between the truncated addresses are multipla of 4096. This loop will 
perform very poorly on 
the PPlain and PPro. At the time you read ECX there is no free cache line 
with the proper 
set-value so the processor takes the least recently used of the two 
possible cache lines, 
that is the one which was used for EAX, and fills it with the data from 
[ESI + 20*4096] to 
[ESI + 20*4096 + 31] and reads ECX.  Next, when reading EAX, you find that 
the cache 
line that held the value for EAX has now been discarded, so you take the 
least recently used 
line, which is the one holding the EBX value, and so on..  You have nothing 
but cache 
misses and the loop takes something like 60 clock cycles. If the third line 
is changed to:

        MOV  ECX, [ESI + 20*4096 + 32]

then we have crossed a 32 byte boundary, so that we do not have the same 
set-value as in 
the first two lines, and there will be no problem assigning a cache line to 
each of the three 
addresses. The loop now takes only 3 clock cycles (except for the first 
time) - a very 
considerable improvement!

Reading a data item which is not in the level one cache causes an entire 
cache line to be 
filled from the level two cache, which takes approximately 200 ns (that is 
20 clocks on a 
100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for 
first are 
available already after 50-100 ns. If the data item is not in the level two 
cache either, then 
you will get a delay of something like 200-300 ns. This delay will be 
somewhat longer if you 
cross a DRAM page boundary.
~~

Doesn't all this go completely against Smalltalk? I mean, I recently found 
myself digging up in the implementation of the CPU... what would really had 
to be well designed is the CPU, and I shouldn't be looking inside it to see 
what's going on, so the CPU is not being thought of as an object properly.

Andres.