[VM] Should probably add -fno-gcse to the GCC flags
John M McIntosh
johnmci at smalltalkconsulting.com
Fri Aug 8 21:36:35 UTC 2003
I'm not sure this is doable,
http://gcc.gnu.org/onlinedocs/gcc-3.3/gcc/Function-
Attributes.html#Function%20Attributes
now last night I was looking at the tinybenchmark numbers and notice
the 3.6.x numbers were slower than 3.5.0. A
lot changed between 3.5.x, and 3.6.x mostly the usage of Andreas VM
change to localize bytecode methods variables, versus the older usage
of unique variables in each bytecode block, a switch to gcc 3.3 versus
2.95.x, but it pointed to Andreas change as being the culprit.
In looking at some assembler I noticed something wrong, so lets pick
bytecode 0.
interpret()
CASE(0)
currentBytecode = byteAt(++localIP);
longAtput(localSP += 4, longAt(((((char *) foo->receiver)) +
BaseHeaderSize) + ((0 & 15) << 2)));
BREAK;
back in a mac VM 3.5.0 compile with gc 2.95.x this resolved to 9
instructions
L1560:
lwz r9,80(r24) (foo structure) "loads the foo->receiver
addi r27,r27,4 (localSP) "does the localSP += 4"
lbzu r28,1(r26) (currentBytecode & localIP) (increments localIP &
gets byte in to currentByteCode)
lwz r0,4(r9) "Adds BaseHeaderSize to
foo->receiver get final address
stw r0,0(r27) (localSP) "Store we do the longAtPut logic"
slwi r9,r28,2 (currentByteCode) "calculate the gnuified jumptable
index using the currentByteCode"
lwzx r9,r9,r25 (jumptable) "load the goto address"
mtctr r9 "move to the control
register and do the jump"
bctr
# define FOO_REG asm("24")
# define JP_REG asm("25")
# define IP_REG asm("26")
# define SP_REG asm("27")
# define CB_REG asm("28")
now with the GCC 3.3. compiler this is where the global common
subexpression logic has an issue, you see quite a few routines jump to
commonSend/normalSend which finds the method based on the method cache.
Now it seems the compiler looks at this and because it doesn't know
where the branch is going because it's a computed goto, but lots do go
to that method lookup, it decides why it should start to pre-load the
methodCache address just because that avoids a stall condition later.
So our nice assembler becomes this below where you'll note the loading
of r8 and the storage of r8 into r1+64. Problem is these two
instructions get added to the instructions for almost every bytecode! I
can appreciate that the intel86 decision could be different, and it
would be interesting to hear what it does. However we now a two
instruction overhead for lots of bytecodes (bad) which btw a read &
write to slow memory.
L1578:
lwz r12,84(r24)
->>>addis r8,r31,ha16(_methodCache-"L00000000049$pb")
lbzu r28,1(r26)
lwz r11,4(r12)
slwi r10,r28,2
stwu r11,4(r27)
lwzx r9,r10,r25
->>> stw r8,64(r1)
mtctr r9
bctr
Now after fidding a bit it and not getting anywhere occurred to me that
I should go back and look at the issue of putting the at, roottable,
and methodcache arrays back in the foo structure just to see what
happens. In the past I did that but CodeWarrior did odd things with the
effective address calculations.
Now things look better, we get rid of the methodcache address loading
L1578:
lwz r11,84(r24) (foo structure) "loads the foo->receiver
lbzu r28,1(r26) (currentBytecode & localIP) (increments localIP &
gets byte in to currentByteCode)
lwz r10,4(r11) "Adds BaseHeaderSize to get final address"
slwi r9,r28,2 (currentByteCode) "calculate the gnuified jumptable
index using the currentByteCode"
stwu r10,4(r27) (localSP) "Store do the longAtPut logic, note the
autoincrement of localSP!"
lwzx r8,r9,r25 (jumptable) "load the address"
mtctr r8 "move to the control register and do the jump"
bctr
The interesting thing here is that GCC 3.3 has altered the execution of
instructions a bit to reduce stalls, and
used a stwu to auto-increment localSP, so the instruction count is 8,
versus 9 in gcc 2.95.x and a removal of one stall condition. This
relates of course to my earlier comments that gcc 3.3 produced faster
code for interp.c
Dropping the three arrays into the foo structure removes 2 instructions
each time the variable is referenced.
things like
addis r28,r31,ha16(_atCache-"L00000000049$pb")
la r7,lo16(_atCache-"L00000000049$pb")(r28)
ori r6,r12,128
add r2,r6,r7
lwz r0,4(r2)
become
ori r4,r12,128
add r9,r4,r24
lwz r0,16644(r9)
I'll look at changing VM maker to enable the storage of arrays into the
foo structure. Can't speak for intel, but would be interesting to know
what decision gcc 3.3 is making for bytecode 0 or where the gcse is
messing up
On Wednesday, August 6, 2003, at 03:41 PM, Andreas Raab wrote:
>> Just re-compiling gnu-interp.c with -O2 -fno-gcse results in faster
>> tinyBenchmarks and slower macroBenchmarks:
>
> In this situation I recommend looking at GCCs __attribute__ pragmas.
> IIRC,
> then GCC 3 actually allows "compiler flags" to be set on individual
> methods
> so doing something like
>
> __attribute__((option("no-gcse")))
>
> (not sure about the syntax) in the right places may be a worthwhile
> little
> tweak (this could go into the gnuifier).
>
>> However, I wonder whether we've really got the best optimization
>> flags. I'm using gcc 3.3 now, and it looked like -O3 resulted in
>> worse tinyBenchmarks performance. I should try it with the
>> macroBenchmarks.
>
> Finding "the best" optimizations is a quite complex task here...
>
> Cheers,
> - Andreas
>
>
>
>
--
========================================================================
===
John M. McIntosh <johnmci at smalltalkconsulting.com> 1-800-477-2659
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
========================================================================
===
More information about the Squeak-dev
mailing list
|