[VM] Should probably add -fno-gcse to the GCC flags

John M McIntosh johnmci at smalltalkconsulting.com
Fri Aug 8 21:36:35 UTC 2003


I'm not sure this is doable,
http://gcc.gnu.org/onlinedocs/gcc-3.3/gcc/Function- 
Attributes.html#Function%20Attributes

now last night I was looking at the tinybenchmark numbers and notice  
the 3.6.x numbers were slower than 3.5.0. A
lot changed between 3.5.x, and 3.6.x mostly the usage of Andreas VM  
change to localize bytecode methods variables, versus the older usage  
of unique variables in each bytecode block, a switch to gcc 3.3 versus  
2.95.x, but it pointed to Andreas change as being the culprit.

In looking at some assembler I noticed something wrong, so lets pick  
bytecode 0.

interpret()
		CASE(0)
			currentBytecode = byteAt(++localIP);
			longAtput(localSP += 4, longAt(((((char *) foo->receiver)) +  
BaseHeaderSize) + ((0 & 15) << 2)));
		BREAK;

back in a mac VM 3.5.0 compile with gc 2.95.x this resolved to 9  
instructions

L1560:
	lwz r9,80(r24)   (foo structure)  "loads the foo->receiver
	addi r27,r27,4  (localSP)          "does the localSP += 4"
	lbzu r28,1(r26) (currentBytecode & localIP)  (increments localIP &  
gets byte in to currentByteCode)
	lwz r0,4(r9)                      "Adds BaseHeaderSize to  
foo->receiver get final address
	stw r0,0(r27)    (localSP)    "Store we do the longAtPut logic"
	slwi r9,r28,2    (currentByteCode)  "calculate the gnuified jumptable  
index using the currentByteCode"
	lwzx r9,r9,r25 (jumptable)  "load the goto address"
	mtctr r9                                   "move to the control  
register and do the jump"
	bctr

# define FOO_REG asm("24")
# define JP_REG asm("25")
# define IP_REG asm("26")
# define SP_REG asm("27")
# define CB_REG asm("28")

now with the GCC 3.3. compiler this is where the global common  
subexpression logic has an issue, you see quite a few routines jump to  
commonSend/normalSend which finds the method based on the method cache.  
Now it seems the compiler looks at this and because it doesn't know  
where the branch is going because it's a computed goto, but lots do go  
to that method lookup, it decides why it should start to pre-load the  
methodCache  address just because that avoids a stall condition later.

So our nice assembler becomes this below where you'll note the loading  
of r8 and the storage of r8 into r1+64. Problem is these two  
instructions get added to the instructions for almost every bytecode! I  
can appreciate that the intel86 decision could be different, and it  
would be interesting to hear what it does.  However we now a two  
instruction overhead for lots of bytecodes (bad) which btw a read &  
write to slow memory.

L1578:
	lwz r12,84(r24)
	->>>addis r8,r31,ha16(_methodCache-"L00000000049$pb")
	lbzu r28,1(r26)
	lwz r11,4(r12)
	slwi r10,r28,2
	stwu r11,4(r27)
	lwzx r9,r10,r25
	->>> stw r8,64(r1)
	mtctr r9
	bctr

Now after fidding a bit it and not getting anywhere occurred to me that  
I should go back and look at the issue of putting the at, roottable,  
and methodcache arrays back in the foo structure just to see what  
happens. In the past I did that but CodeWarrior did odd things with the  
effective address calculations.

Now things look better, we get rid of the methodcache address loading

L1578:
	lwz r11,84(r24) (foo structure)  "loads the foo->receiver
	lbzu r28,1(r26) (currentBytecode & localIP)  (increments localIP &  
gets byte in to currentByteCode)
	lwz r10,4(r11)        "Adds BaseHeaderSize to get final address"
	slwi r9,r28,2	(currentByteCode)  "calculate the gnuified jumptable  
index using the currentByteCode"
	stwu r10,4(r27)  (localSP)   "Store do the longAtPut logic, note the  
autoincrement of localSP!"
	lwzx r8,r9,r25 (jumptable)  "load the address"
	mtctr r8           "move to the control register and do the jump"
	bctr

The interesting thing here is that GCC 3.3 has altered the execution of  
instructions a bit to reduce stalls, and
used a stwu to auto-increment localSP, so the instruction count is 8,  
versus 9 in gcc 2.95.x and a removal of one stall condition. This  
relates of course to my earlier comments that gcc 3.3 produced faster  
code for interp.c

Dropping the three arrays into the foo structure removes 2 instructions  
each time the variable is referenced.
things like 	
	addis r28,r31,ha16(_atCache-"L00000000049$pb")
	la r7,lo16(_atCache-"L00000000049$pb")(r28)
	ori r6,r12,128
	add r2,r6,r7
	lwz r0,4(r2)
become
	ori r4,r12,128
	add r9,r4,r24
	lwz r0,16644(r9)

I'll look at changing VM maker to enable the storage of arrays into the  
foo structure. Can't speak for intel, but would be interesting to know  
what decision gcc 3.3 is making for bytecode 0 or where the gcse is  
messing up

On Wednesday, August 6, 2003, at 03:41  PM, Andreas Raab wrote:

>> Just re-compiling gnu-interp.c with -O2 -fno-gcse results in faster
>> tinyBenchmarks and slower macroBenchmarks:
>
> In this situation I recommend looking at GCCs __attribute__ pragmas.  
> IIRC,
> then GCC 3 actually allows "compiler flags" to be set on individual  
> methods
> so doing something like
>
> __attribute__((option("no-gcse")))
>
> (not sure about the syntax) in the right places may be a worthwhile  
> little
> tweak (this could go into the gnuifier).
>
>> However, I wonder whether we've really got the best optimization
>> flags. I'm using gcc 3.3 now, and it looked like -O3 resulted in
>> worse tinyBenchmarks performance. I should try it with the
>> macroBenchmarks.
>
> Finding "the best" optimizations is a quite complex task here...
>
> Cheers,
>   - Andreas
>
>
>
>
--
======================================================================== 
===
John M. McIntosh <johnmci at smalltalkconsulting.com> 1-800-477-2659
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
======================================================================== 
===



More information about the Squeak-dev mailing list