There's memory bandwidth and there's memory transaction thruput

Jan Bottorff janb at
Wed Feb 10 01:38:51 UTC 1999

At 08:48 AM 2/9/99 -0800, Tim Rowledge wrote:
>memory system. I've always said that you need three things for high
>performance Smalltalk; memory bandwidth, memory bandwidth and err, yes
>that's it, memory bandwidth. Clever work in the design of the VM simply
>substitutes for bandwidth. Same with caches on the cpu - I'll take a

In the last few years it's become real apparent to me there is a BIG
difference between sequential memory bandwidth, and what I'll call memory
transaction thruput. For example, SDRAM has pretty good sequential
bandwidth (800 MBytes/sec), and not so good transaction thruput (about 12
Megatransaction/sec for 100 Mhz SDRAM. if we assume 5-1-1-1 cache burst

This matters greatly because most processors access memory thru cache line
fills and flushes. I know lots  more about Intel processors than others, so
I'll use an Intel processor example. When the processor touches memory, and
it's not in L1/L2 cache, somewhere between 1 and 6 memory transactions have
to occur before the processor can complete the access. The best case for a
cache miss is: the L1/L2 cache slot is read-only and can just be dumped,
and a single main memory transaction can fill the cache line. The worst
case is more like the processor touches a memory location, but first finds
the page translation lookaside buffer and L1/L2 cache don't contain the
desired page table line. It may then find the cache slot to store the
correct page table line is dirty, so has to flush the line first. It can
then load the page table line into L1/L2 cache while loading the TLB with
the translation address to finally generate the correct physical address.
This address may also refer to an address than is not in cache, and happens
to map to a dirty line, so another write line followed by read line is
required. This is 4 transactions now, transfering a total of 128 bytes (32
bytes per line). It can actually get even worse than this, as the Intel
page tables have two levels, you may take cache misses on both levels of
the address translation, or 6 memory transactions just to touch a memory
location (192 bytes or about 4 Megatransactions/sec).

It's clear that cache misses that also miss in the TLB are just super slow,
It's a function of the randomness of access and also cache size to total
memory size ratio (controls page table cache hit ratio). I believe
Smalltalk tends to have much more random memory access patterns than more
mainstream languages like C (I don't have hard data to prove this). Objects
tend to be small and related objects will often be spread between different
memory pages (hint hint to garbage collector writers), causing significant
TLB thrashing. Your 800 MBytes/sec memory can get as slow as about 16
Mbytes/sec (4 Megavalues/sec) for accessing random 32-bit values. For
walking memory reference chains (like object references in Smalltalk) the
speed of memory transactions can becomes a much more real performance limit
than sequential memory bandwidth.

So the goal should be for OS, virtual machine, and hardware designers to
maximize these memory transactiosn rates. Some improvements include:

1) big fast memory caches - anybody have Squeak performance numbers
comparing a 512KB cache P2-400 with a 2 MByte Xeon-400?

2) RAMBUS memory - by pipelining the memory access latency with the actual
transfer phase, memory transaction rates can be doubled (or more), this
also may need improvments in processor memory access prediction (even
deeper out of order execution, cache prefetch instructions, smarter cache
controller designs than keep an otherwise idle memory pipeline busy with
dirty line flushes for example)

3) you could design an OS that didn't use virtual memory paging, so all the
overhead of TLB thrashing would be gone, writing a Smalltalk virtual
machine that ran in unprotected memory with no logical to physical mapping
is what were talking about, implementing virtual memory via object swapping
might be a very viable option

4) arrange data structures so memory accesses will be "clustered" better,
getting better L1/L2 cache and TLB hit rates,  I believe this is EXACTLY
the same issue as clustering objects for virtual object swapping, dynamic
rearranging of memory (old space) might produce performance gains in
Smalltalk systems too

I suspect all processors with paged virtual memory have these issues. Some
processors do have much larger caches (direct connection with processor
price?). I also suspect the processor designers tend to run processor
simulations of typical C/C++ programs, and it would be a real eye opener
for them to see the access patterns of a Smalltalk system. Designers of 12
pipeline stage processors (like the Pentium II) have obviously not
optimized for execution environments that get a branch prediction miss
every bytecode (flushing the execution pipeline every 5-10 instructions).
Much smarter calculated branch lookahead (calculate the predicted branch
target out of order, instead if just using instruction pointer history)
might do wonders for interpretor performance. One of the big gains from
dynamic native code translation is greatly improved branch prediction and
thus processor pipelineling, as every inline primitive or send essentially
gets a unique processor branch prediction slot. I'm not sure much of this
benefit will be received from threaded code, at least on current Intel
processors. It would be very interesting to compare Smalltalk bytecode
performance per processor Mhz for short and long pipeline processors.

- Jan

            Paradigm Matrix Inc., San Ramon California
   "video products and development services for Win32 platforms"
Internet: Jan Bottorff janb at
Phone: voice  (925) 803-9318
       fax    (925) 803-9397
PGP: public key  <>
     fingerprint  52 CB FF 60 91 25 F9 44  6F 87 23 C9 AB 5D 05 F6

More information about the Squeak-dev mailing list