My professor showed me this paper on compiler optimizations that can be applied to make asynchronous message passing fast on current multicore processors: http://veda.eas.asu.edu/~vrudhula/OO-multiproc.pdf
I wonder exactly what hardware features could make this even faster.
My first guess is tagging; by tagging I mean that some objects could be stored un-boxed, and so preventing an additional memory lookup to determine the class of the object. Tagging seems to mean different things to different people; what I mean is that every word would have a few extra bits that the CPU would generally ignore, and so were free to be used by the programmer any way they want. One use would be to use the extra bits to store some type information; for example, one could use one code to represent pointers, another for integers, another for future objects, etc. Some classes could be fully represented with just a one-word pointer, like Smalltalk does with SmallIntegers.
What other ways could this be better?