Hi David, The bytecode benchmark is a prime number sieve. It uses #at: and #at:put:. The send benchmark is a simple recursive Fibonacci function. Both are just measures of how quickly they execute, neither really measures the actual bytecodes or sends performed. They are the old tinyBenchmarks. I'd guess everyone ran the same code for these benchmarks.
I 100% agree that inlining is the right way to optimise common sends and block execution. I'd just rather finish debugging Exupery and getting it fully working without inlining then add inlining. Inlining will add another case to think about when debugging. Debugging full method inlining (1) will be much easier if the compiler is bug free first.
My rough long term plan is: 1.0: The minimum necessary to be useful. 2.0: Inlining 3.0: SSA optimisation
A strong reason for not doing inlining in 1.0 is it will reduce scope creep. If inlining is not in 1.0 then finishing 1.0 is more important.
I'd also not be surprised if Strongtalk is faster than Exupery for bytecode performance. I'm guessing that Strongtalk's integer arithmetic and #at: performance are better. Squeak uses 1 for it's integer tag so in general it takes 3 instructions to detag then retag and 2 clocks latency (this can be optimised often be optimised to 1 instruction and 1 clock latency). I'm guessing Strongtalk uses 0 for it's integer tag.
Squeak uses a remembered set for it's write barrier which requires checking if the object is in the remembered set, and checking if the object is in new-space before adding it. Strongtalk might be using a card marking table just requiring a single store.
Squeak stores the size of an object in one of two places. So to get the size to range check you first need to figure out where it's stored. I'm guessing that the size for an array is stored at a fixed location in Strongtalk.
My assumptions about Strongtalk's object memory are based on reading the papers from the Self project.
None of these things really matters to Squeak while it's running as an interpreter because most of the time is spent recovering from branch mispredicts or waiting for memory leaving plenty of time available to hide the inefficiencies above.
One way to get around a slow compiler would be to save the code cache beside the image. All relocation is done in Smalltalk, so doing this shouldn't be too hard. But figuring out how get around a slow compiler can wait until after the compiler has become useful. There are several answers including writing a faster register allocator (2) or being the third compiler.
Bryce
(1) Exupery can already inline primitives. It uses primitive inlining to optimise #at: and #at:put:. This is one reason why Exupery has PICs. They are a way to get type information for primitive calls.
(2) Having a coalescing register allocation makes unnecessary moves free. This is helpful to hide working on a two operand machine from the compiler front end. There may be some work to make Exupery perform well without it's register allocator.