There's memory bandwidth and there's memory transaction thruput

Wed Feb 10 18:24:59 UTC 1999

>Jan Bottorff wrote:
>>I suspect all processors with paged virtual memory have these issues. Some
>>processors do have much larger caches (direct connection with processor
>>price?). I also suspect the processor designers tend to run processor
>>simulations of typical C/C++ programs, and it would be a real eye opener
>>for them to see the access patterns of a Smalltalk system. Designers of 12
>>pipeline stage processors (like the Pentium II) have obviously not
>>optimized for execution environments that get a branch prediction miss
>>every bytecode (flushing the execution pipeline every 5-10 instructions).

At 08:02 PM 2/9/99 -0600, Tim Olson wrote:
>Back when I was designing PowerPC processors at Apple, we paid great 
>attention to the "ugly" code that made up much of the typical MacOS stuff 
>(including 68K emulation).  We took multi-megabyte traces of Applications 
>and OS code, and analyzed them.  This stuff had branches an average of 1 
>every 4-5 instructions, and deep pipelines w/ branch prediction didn't 
>help much.

For what it is worth, benchmarks taken on Sparc processors
years ago (of SPEC 92 or 89, written in C and Fortran, with relatively
competent compilers) yielded similar numbers, especially in
hot inner loops.  Be very careful when someone quotes you "average"
numbers -- find out if they are talking about median, mode, or mean.
The median and mode are similar, about 4 or 5 instructions, often
in nasty data-dependent blocks of the form:

  data-dependent load that might be "illegal"
  arithmetic based on result of load
  comparison with some quantity in a register
  conditional branch to a similar block.

Smalltalk has no monopoly on this sort of glop.

There were changes made in Sparc version 9 that were intended
to help compilers with these problems, but they only go so far.

Note that if you are talking about the "mean" block length
among "hot blocks", the Fortran codes tend to skew the average
towards huge numbers (One benchmark, fppp[p?], has an inner loop
with thousands of instructions in a single basic block), and
that Fortran codes are often very amenable to latency-hiding
or latency-avoiding optimizations not often applied to C or C++.
These optimizations aren't that useful in Smalltalk or Java
either, except perhaps within the garbage collector.

Pat Caudill wrote:
> There was a paper about this at ISMM '98 which preceeded the last OOPSLA. 
> See the paper by Chilimbi and Larus. It worked well for C++ but not as 
> well with Java because the larger object headers filled the cache lines 
> as I remember.

Given that Java object headers need only be two words long for many
systems, how does this compare with C++, which often pays both a
malloc overhead (depends on the allocator, it can be zero, but it
is often two words) and a VTBL pointer overhead (at least one word,
more if there is tricky inheritance going on)?

People talk about implementation characteristics as if they were
measuring the gravitational constant or the speed of light, and
that's just plain wrong.

David Chase
NaturalBridge LLC