Squeak build problem...

Fri Aug 16 21:32:42 UTC 2002

On Fri, 16 Aug 2002, Tommy Thorn wrote:

> Alan Grimes wrote:
> 
> >It seems that the register allocation hypothesis holds.

Nope.  I just looked at the code and register vars are not being spilled.

> >The -O0 setting
> >produced a 1,072,720 byte executable that is within striking distance of
> >the Gcc 2.95? numbers. 

Hmmm.  I wouldn't call 20-or-so% "striking distance". ;)

> Why don't you just look at the code generated?

What's killing performance in 3.1.1 is over-aggressive CSE.

A side-effect of the early Jitter work with threaded code was that
Johm Maloney (or maybe Dan, but I think it was John) applied the
"fetch next insn and dispatch without looping" idea from
direct-threaded code to the bytecode Interpreter.  Each bytecode
finishes by fetching the next bytecode as early as possible in each
bytecode's body for the next dispatch round the loop.  Gnuification
finishes this process by transforming the final "break;" in each
bytecode into an indirect jump through the bytecode table to the body
of the next bytecode, making fetch-dispatch as efficient as possible
(it cuts out the dispatch loop entirely).  However, gcc3.1.1 is
ELIMINATING *ALL* of these independent fetch-dispatch sequences and
folding them into a single, shared dispatch at the head of the loop.

But... it gets better (groan).  Here's the code from the head (or
rather the tail, since the head is only executed once) of the
interpret() dispatch loop (esi is localIP and edi is localSP):

# ... jump table

.L1136:
.LM1924:
        movl    receiver, %eax          # cse: [1] ==>
.LM1925:
        incl    %esi                    # cse: localIP++
.LM1926:
        addl    $4, %edi                # cse: pop()
.LM1927:
        movzbl  (%esi), %ebx            # cse: fetchByte()
.LM1928:
        movl    4(%eax), %eax           # cse: obj = instVarAt(1)) <==[1] 
.L2361: <--------------- all bytecodes that write the stack finish
                         by jumping here
.L2015:
        movl    %eax, (%edi)            # cse: popThenPush(obj)
.L2337: <--- bytecodes that don't write the stack end by jumping here
        movl    trueObj, %ebp           # code motion: movl/cmpl &true,x
                                          out of the loop (*POINTLESS*)
.L2369: <--- only ONE bytecode arrives here (it doesn't clobber ebp)
        movl    falseObj, %ecx          # code motion: movl/cmpl &false,x
                                          out of the loop (*POINTLESS*)
.L2338: <--- only ONE bytecode arrives here (it doesn't clobber ebp or ecx)
        jmp     *jumpTable.1(,%ebx,4)   # dispatch next bytecode

# bytecodes ...

(this is the ONLY dispatch through jumpTable [other than the single
dispatch after the initial load of currentBytecode]).

If anything it's a worse disaster than occasionally spilling a
register variable.  The vast majority of bytecodes are reloading
trueObj and falseObj into registers on EVERY SINGLE DISPATCH.  While
one can see the advantages of what the compiler is doing for
`mostly-linear' programs, *any* interpreter is going to suffer *badly*
from this kind of aggressive (and reckless) CSE.

(If anyone knows someone who recompiles other interpreted languages in
their spare time, I'd be very interested to hear about their
experiences.)

I've not found a way to turn CSE off yet, but I've only played with
-fno-{gcse,cse-follow-jumps,cse-skip-blocks}.  There are other
possibilities (such as setting a limit of zero on the amount of memory
and/or number of passes the compiler is allowed to devote to cse).  If
I manage to persuade the compiler to act in a more responsible manner
I'll post the relevant incantation.

Ian (who can hardly believe his eyes)

PS: FWIW, gcc3.0 produces noticably better code than 2.95 with
    identical optimisation options:

        gcc2.95.3 100470957 bytecodes/sec; 3025428 sends/sec'
        gcc3.0.3  103392568                3351243