[Newcompiler] Optimizing compiled methods for performance [was: A day optimizing]

Wed May 30 20:15:45 UTC 2007

Nice performance results for Nile.

Klaus D. Witzel writes:
 > When discussing which construct saved what bytecode, Stef asked, would it  
 > be possible that having a smarter compiler we avoid to write shit code like
 > 
 > 	position := position + (size := 1)
 > 
 > but write it in a readable way and the compiler does its job?

It would be more efficient for a JIT to write:

   size := 1.
   position := position + 1.

The reason is tree re-writing (which is both simple and quick) can
then remove tag checking and detagging of the 1 done by the +
bytecode. Exupery has done this optimization for at least a year. I
think Ian's later JITTERs were also going do this optimization.

To optimize:

   position := position + (size := 1)

Would require global analysis to do constant propagation effectively
producing the 2 statement version. A fast inline JIT is unlikely to do
any optimization that requires data flow analysis or SSA as data flow
analysis is n^2 worst case though normally only n log n.1)  Exupery does
not yet do constant propagation and probably will not for a few more
years.

Rewriting to "position := position + (size := 1)" is optimizing for an
interpreter but making life harder for a compiler. You're doing
reverse constant propagation.

My feeling is bytecode should be treated as an efficient compact
encoding of the abstract syntax tree with a very few important
optimizations such as whileTrue: and ifTrue: inlining. I'm not
suggesting that we shouldn't provide better facilities for bytecode
manipulation (ByteSurgeon or debugger probes), just that when no
manipulation has been performed the translation is simple and
reversable. The reason is I've just spent all my Squeak time for the
last month chasing a bug through the interpreter. It should be simple
to look at the system from different angles. Being able to look at
source, bytecode, and interpreter behavior is useful when doing low
level debugging.

In five years time, it would be nice if the interpreter ran 10% slower
but was simpler, provided more uniform semantics, and was easier to
modify. Especially if a JIT project delivers so interpreter
performance becomes less important.

Bryce

1) Iterated data flow analysis using a decent ordering iterates over
the graph once for each nested backwards jump (loop) which is roughly
n log n. It is possible to avoid iterating the entire graph each time
but more work. Exupery doesn't even do the iteration in the right order
which is one of the current performance bugs.

Klaus sent me this link which is a decent fast overview of
optimization:

  http://www.cs.rice.edu/~keith/512/Lectures/04GCSE-4up.pdf

It does include a neat trick I didn't know of that will be
useful. Reverse pre-order is reverse reverse post-order. Liveness
analysis is calculated backwards so that's a neat way of computing
the order correctly.

SSA is probably going to be faster than using data-flow analysis if
you do several optimizations as it removes the need for data flow for
many optimizations, you pay once but get to use many times.