Compiler optimizations on OO hardware

List overview All Threads
Download

newer

older

language neutral processors

Re: [Box-Admins] Missing reply-to...

Matthew Fulmer

2 Jul 2007 2 Jul '07

12:07 a.m.

My professor showed me this paper on compiler optimizations that can be applied to make asynchronous message passing fast on current multicore processors: http://veda.eas.asu.edu/~vrudhula/OO-multiproc.pdf

I wonder exactly what hardware features could make this even faster.

My first guess is tagging; by tagging I mean that some objects could be stored un-boxed, and so preventing an additional memory lookup to determine the class of the object. Tagging seems to mean different things to different people; what I mean is that every word would have a few extra bits that the CPU would generally ignore, and so were free to be used by the programmer any way they want. One use would be to use the extra bits to store some type information; for example, one could use one code to represent pointers, another for integers, another for future objects, etc. Some classes could be fully represented with just a one-word pointer, like Smalltalk does with SmallIntegers.

What other ways could this be better?

-- Matthew Fulmer -- http://mtfulmer.wordpress.com/ Help improve Squeak Documentation: http://wiki.squeak.org/squeak/808

Show replies by date

Klaus D. Witzel

2 Jul 2 Jul

7:51 p.m.

On Mon, 02 Jul 2007 00:07:33 +0200, Matthew Fulmer wrote:

...

My professor showed me this paper on compiler optimizations that can be applied to make asynchronous message passing fast on current multicore processors: http://veda.eas.asu.edu/~vrudhula/OO-multiproc.pdf

...

I wonder exactly what hardware features could make this even faster.

I know memory tags since the B5x00 and have used them,

- http://en.wikipedia.org/wiki/Burroughs_large_systems#Tag-based_architecture

In my experience memory tags make sense when the hardware knows about them (of course software must be able to write them, rarely also read them).

What other hardware feature could make message passing faster: I think that parallel, disjoint instruction streams per processor could be useful (as an alternative to multicore processors).

Optimizing compilers attempt to generate code sequences which under "all circumstances" do not stall the processor. Instead, another instruction stream (which must not necessarily be releated to the "main" instruction stream) could fill the gap (perhaps a stream with lower priority). This would free architects and compilers from having to take into account every detail about a processor's "optimum" instruction sequences and at the same time reward every pair of concurrently running programs.

An example: while sending a message, another message could be received (or be prepared) simultanously. Compare that to two processors of a multicore which have no means for avoiding stalls (other than asking the architects and compilers for better optimized code and cores).

Cheers Klaus

...

My first guess is tagging; by tagging I mean that some objects could be stored un-boxed, and so preventing an additional memory lookup to determine the class of the object. Tagging seems to mean different things to different people; what I mean is that every word would have a few extra bits that the CPU would generally ignore, and so were free to be used by the programmer any way they want. One use would be to use the extra bits to store some type information; for example, one could use one code to represent pointers, another for integers, another for future objects, etc. Some classes could be fully represented with just a one-word pointer, like Smalltalk does with SmallIntegers.

What other ways could this be better?

Jecel Assumpcao Jr

3 Jul 3 Jul

9:12 p.m.

Matthew Fulmer wrote:

...

My professor showed me this paper on compiler optimizations that can be applied to make asynchronous message passing fast on current multicore processors: http://veda.eas.asu.edu/~vrudhula/OO-multiproc.pdf

I wouldn't call these compiler optimizations but rather a very clever run time system design. And the AP1000 was a distributed memory machine with an interesting communication network and so was rather different from current multicore processors, though many of their ideas can be used with those.

The key idea of this work is that the number of active threads in a machine should roughly match the number of processors if you want the best possible performance. So if you have 512 processors, a software system which spawns 16K threads will run very poorly as will another that only creates 8 threads. This paper adapts the actual semantics of message sends to the run time conditions which results in a roughly constant number of active threads.

In http://www.lsi.usp.br/~jecel/jabs1.html I extended this work using the Self adaptive compilation technologies to not only transform many of the asynchronous mesasges into synchronous ones as needed but to actually inline the latter away. This was never implemented, however.

...

I wonder exactly what hardware features could make this even faster.

My first guess is tagging; by tagging I mean that some objects could be stored un-boxed, and so preventing an additional memory lookup to determine the class of the object. Tagging seems to mean different things to different people; what I mean is that every word would have a few extra bits that the CPU would generally ignore, and so were free to be used by the programmer any way they want. One use would be to use the extra bits to store some type information; for example, one could use one code to represent pointers, another for integers, another for future objects, etc. Some classes could be fully represented with just a one-word pointer, like Smalltalk does with SmallIntegers.

The big overhead that they are trying to deal with is that you have to find out the state of the receiving object before you can decide what sending a message to it means. Their trick was to allow each object to switch between a small number of "vtables" (classes, maps) so that not only their "type" was encoded in that pointer in the object's header but its state as well. This saves a lot of extra tests but you still need to read a word from the object and that is costly.

If you could encode the state in some tag then you would save a memory read, but that would only work for objects that don't change state (it doesn't make sense to consider a SmallInteger to be busy, for example).

An alternative model is to group objects and have one thread per group. Let's call these groups "islands" since that seems to be a popular term now. In this model individual objects wouldn't be busy or idle but instead the semantics of a message send would depend on whether the sender and receiver belong to the same island or not. And we could also have "read-only islands" which wouldn't have their own threads but instead of use the sender's thread to get things done. We could have some tags in each object reference indicating to which island it belongs and whether that island is read-only, and then we could decide how to execute a message send without any memory accesses. Some of the ideas in the paper would still apply for this model but the ones in my own paper would no longer work.

-- Jecel

6154

Age (days ago)

6156

Last active (days ago)

hardware@lists.squeakfoundation.org

2 comments

3 participants

tags (0)

participants (3)

Jecel Assumpcao Jr
Klaus D. Witzel
Matthew Fulmer