[Hardware] Compiler optimizations on OO hardware

Tue Jul 3 19:12:11 UTC 2007

Matthew Fulmer wrote:
> My professor showed me this paper on compiler optimizations that
> can be applied to make asynchronous message passing fast on
> current multicore processors:
> http://veda.eas.asu.edu/~vrudhula/OO-multiproc.pdf

I wouldn't call these compiler optimizations but rather a very clever
run time system design. And the AP1000 was a distributed memory machine
with an interesting communication network and so was rather different
from current multicore processors, though many of their ideas can be
used with those.

The key idea of this work is that the number of active threads in a
machine should roughly match the number of processors if you want the
best possible performance. So if you have 512 processors, a software
system which spawns 16K threads will run very poorly as will another
that only creates 8 threads. This paper adapts the actual semantics of
message sends to the run time conditions which results in a roughly
constant number of active threads.

In http://www.lsi.usp.br/~jecel/jabs1.html I extended this work using
the Self adaptive compilation technologies to not only transform many of
the asynchronous mesasges into synchronous ones as needed but to
actually inline the latter away. This was never implemented, however.

> I wonder exactly what hardware features could make this even
> faster.
> 
> My first guess is tagging; by tagging I mean that some objects
> could be stored un-boxed, and so preventing an additional memory
> lookup to determine the class of the object.  Tagging seems to
> mean different things to different people; what I mean is that
> every word would have a few extra bits that the CPU would
> generally ignore, and so were free to be used by the programmer
> any way they want. One use would be to use the extra bits to
> store some type information; for example, one could use one code
> to represent pointers, another for integers, another for future
> objects, etc. Some classes could be fully represented with just
> a one-word pointer, like Smalltalk does with SmallIntegers.

The big overhead that they are trying to deal with is that you have to
find out the state of the receiving object before you can decide what
sending a message to it means. Their trick was to allow each object to
switch between a small number of "vtables" (classes, maps) so that not
only their "type" was encoded in that pointer in the object's header but
its state as well. This saves a lot of extra tests but you still need to
read a word from the object and that is costly.

If you could encode the state in some tag then you would save a memory
read, but that would only work for objects that don't change state (it
doesn't make sense to consider a SmallInteger to be busy, for example).

An alternative model is to group objects and have one thread per group.
Let's call these groups "islands" since that seems to be a popular term
now. In this model individual objects wouldn't be busy or idle but
instead the semantics of a message send would depend on whether the
sender and receiver belong to the same island or not. And we could also
have "read-only islands" which wouldn't have their own threads but
instead of use the sender's thread to get things done. We could have
some tags in each object reference indicating to which island it belongs
and whether that island is read-only, and then we could decide how to
execute a message send without any memory accesses. Some of the ideas in
the paper would still apply for this model but the ones in my own paper
would no longer work.

-- Jecel