Squeak Speed (Was: A few low-level Pentium II performance measurements)

Sat Feb 20 19:17:21 UTC 1999

Hi,

I am just catching up on my squeak e-mail, and it has been interesting
to read the threads about memory bandwith and Pentium II (and PowerPC)
performance measurements.  My day job has been working on Performance
Validation for one of the new Intel processors, and I have been working
on speeding up Squeak a little on the side.  (Unfortunately, I don't
have as much time for that as I would like.)

>From the performance numbers given by J Chapman, it appears that squeak 
is more limited by branch mispredictions that memory access.  
The PowerPC numbers indicate about 1.3 million cycles per second 
loading the cache, and about the same number of branch mispredictions.*  
Since branch mispredictions are likely to cost more than one cycle, 
it is spending more cycles on branch mispredictions.

The PowerPC 750 pipeline is so short that branch mispredictions aren't 
likely to have too bad of an effect.  There are only 4 stages:
fetch, decode, execute, and writeback.**  Branches should get resolved
in execute, only 2 cycles after fetch.

The Pentium II has a much longer pipeline, consisting of about 10
stages to execute.***  This much longer pipeline is required to decode 
x86 instructions.  (It also seems that Intel is targeting higher
clock speeds.)  But, because of the longer pipeline, branch mispredictions
are considerably more costly.  On a full misprediction, the results
from execute must be fed back into fetch.

It appears that the x86 gets hit even harder with branch table buffer
misses.  From the numbers posted by Jan Bottorff
    branch instructions / instructions (retired) = 1/5.33 = 0.188
    branch table buffer misses / branch instruction =       0.43
    -> branch table buffer miss / instruction       =       0.08  = 12.4
This means one miss every 12 instructions, which can also limit
performance. 

Overall, it looks like cache misses aren't hurting squeak much.
Some posters have mentioned that Smalltalk code tends to have very
random accesses.  However, the compacting garbage collector
probably helps.  And, modern L2 caches are getting quite large
and reasonably low latency.  However, branch misprediction penalty
may hurt squeak even more in the future.  The longer the pipeline
is, the worse branch misprediction hurts.  And, long pipelines
are getting more common.   The P6 (Pentium Pro/II/III) and Alpha 
already have long pipelines.  The K7 will have a similar one.  

            ------------------ My project -----------------

I have been wanting to work on a high-performance version of Smalltalk
for some time, so when Squeak came along, I dedided to see what
I could do about optimizing it.

I am trying to work from the bottom up.  My approach is to think
"what is the minimum number of cycles a certain operation (such as 
a message dispatch) can be executed in" and try to approximate that.

For example, there is an interesting paper titled  
"Message Dispatch on Modern Computer Architectures" that desribes 
various dispatch machanisms.  Among other things, it describes
the inline caches that are sometimes used in some versions 
of Smalltalk.  The paper describes an method dispatch using
only 5 instructions, including a load, constand load,
predictable indirect call, compare, and predictable branch.

Currently, squeak is far, far away from this minimum.
Just how close is it possible, and practical, to get?

Although I am trying to think from the bottom up, I don't
have time to do an implementation from scratch, so I am
working from the top down, taking small steps.
My first project is to simplify determining an object's class, 
reducing the number of instructions and branch mispredictions 
required to do so. 

Currently, squeak uses an object header that varies from
one to three words.  In order to determine the class of an object,
Squeak must:
  (1) Test to see if it is an integer
  (2) Test the header type to see where the class field is located
  (3a) If it is compact get the class from the compact table
        - this probably means you need to load the compact table base
  (3b) If it isn't compact get the class pointer from the previous word.
This can be a costly process.  Since integer, short header objects,
and longer objects are all common, you can get up to two branch
mispredictions.

Also, the garbage collector frequently needs to test for the size
of an object.  The different header types also make this a significant
amount of work.

I have been working on creating and testing out a newer header type
that should reduce the number of branch misprediction.
Instead of one to three header words, all objects would have 2 header words.
Word 1
  - Similar to current basic header, except no compact classes.
    The bits used for compact classes is used for the size instead. 
    The size field is a lookup into a table for the actual allocated size.
Word 2
  - Class pointer.
All variable sized objects would need an additional word for the actual
size used.

Other than determining if the object was an integer or not, this would 
eliminate all of the decisions in determining an objects class, and
for GC purposes, its size.  Its disadvantage is that the majority
of headers go from one to two words.  I haven't started on the "size" portion
yet, but incrasing the minimum size header to two words increases
the small image I am using by only 10%.  I hope that it will increase
speed by more than 10%.

I would love to work with anyone else who is working on speeding
up Squeak.  Smalltalk's performance has been a personal interest for
me for some time, and it is interesting to actually be able to work
on it.

Sincerely,

Greg Gritton
gritton at ibm.net

* It might be worth double-checking the numbers.
  It seems odd that the number of branch mispredictions is less
  than the number of bytecodes executed.

** From the PowerPC 750 Technical Summary
   at: http://www.chips.ibm.com/techlib/products/ppc/techsum.html

*** From the Pentium II Processor Developer's Manual
    at: http://developer.intel.com/design/PentiumII/manuals