message dispatch (was: RISC42)

Fri Nov 17 23:56:42 UTC 2006

Alan Kay wrote on Fri, 17 Nov 2006 08:15:51 -0800
> >[vast table of all class/selector combinations]
> 
> We called this the "giant hash table scheme" at PARC and thought 
> about using some (future) VM hardware to look up this more useful 
> virtual address for methods.

In this case I have the full table, not a hash one. The address is
virtual in the sense that I have an object table, but otherwise it is
pretty much a physical one. When David Ungar saw my talk about this in
OOPSLA 2003, where I mentioned I had only gone this route because the
smallest memory I could buy cheaply was an 8MB SDRAM chip for $2, he
wondered if some papers being presented at that conference wouldn't
become irrelevant in the future due to brute force advances in hardware.
But I think this particular machine is very atypical and all the great
message dispatch theory of the past still applies. In fact, I would
recommend that people interested in this read this comparison paper:

http://www.cs.ucsb.edu/~urs/oocsb/papers/dispatch.html

But for this machine I had 512KB of permanent storage (90KB of that is
the bits for programming the FPGA) and 8MB of RAM. So though the system
is now Smalltalk it has to be as tiny as the Forth it was originally to
fit in the Flash, yet it can waste megabytes of RAM in helper tables to
make things a little faster. This isn't the future, just a small niche
(a certain kind of embedded applications).

Due to the 16 bit object pointers, this system has a very poor
reflective structure. Methods aren't objects, for example, but rather
all code for a class is lumped into a single vector. This starts out
with the corresponding row of the class/selector table and also contains
the rest of the methods that don't fit in two words. The selector times
2 directly gives you the initial value for the program counter (which is
an index into that vector) for the method.

Given how small the system is and how fast current machines are (even
this processor at only 54MHz) the implementation isn't very incremental.
When you add a new selector to the system all classes are regenerated to
expand their table part to make room for this. When the source for a
method is edited the whole class and subclasses (the system doesn't
actually have these, but these Smalltalk-80 terms give the rough idea)
are recompiled.

> I'd love to hear how this works out (and what kinds of HW would do it 
> better (like a specially programmed FPGA, etc.).

The 32 bit version will be more interesting since brute force solutions
don't work. I have just created a new page to explain the "PIC Mode"
(http://www.merlintec.com:8080/hardware/33) but I probably won't be able
to put any real content there until next month. The idea is very simple,
however:

The processor has a normal execution mode and a PIC mode. During normal
execution instructions are fetched from the location in the code cache
pointed to by the program counter. It doesn't fetch instructions
directly from methods - you have to copy the bytecodes (or translate, if
the processor doesn't directly execute bytecodes) to the code cache and
jump there. Inside the processor there is normally an instruction cache.

Some instructions (like "send") can switch the processor to PIC mode. In
that case instructions are fetched from a PIC cache instead of the code
cache. The instruction doing the switch must supply a "type" parameter
and the PIC cache is accessed with a 64 bit <PC,Type> address. The PC
part is the address of the instruction which switched to PIC mode and
the instructions are streamed from the PIC cache without changing the
PC. If the whole cache line is used then execution goes back to the
normal mode at PC+1. PIC mode is also exited if any branch or call
instructions are executed.

So if you execute a send instruction then the next instruction is
determined by the receiver's class (type). Suppose that this send
bytecode is at PC=16r213412 in the code cache and that the PIC cache has
five entries tagged as <16r213412,X> with five different values of X. If
the receiver type doesn't match any of them we call the copier (or
translator) to create a new entry. If the receiver type is one of those
five, then several instructions (up to the size of a cache line) are
executed as if there were inlined between 16r213412 and 16r213413. A
very common case is for these instructions to simply call a given
method, but for simple and short methods they could be the body of the
method itself.

Some software has to manage all the related entries in the PIC cache and
this is the source of type information when a method is being recompiled
to generate more optimized code.

David Faught wrote:
> Wouldn't Content Addressable Memory (CAM) work for this?  I can
> remember reading about this years ago, and I know that some current
> network routers and switches use this to speed up their table lookups.

CAMs are the way to go at least for the caches inside the chip itself.
For larger second level caches that live in main memory you wan't to use
some kind of hashing solution. Unfortunately FPGAs are very bad at
implementing CAMs, so I am forced to using hashing for the first level
caches as well. On a custom chip version of the design I would do as you
suggested.

-- Jecel