global variables as structure in VM.

John.Maloney at John.Maloney at
Fri Apr 5 18:34:20 UTC 2002


These performance figures are quite interesting. I think it's possible
that this change would also improve performance on register-poor machines
such as the 68K, which never ran Squeak as fast as it should have
(based on the relative performance of a set of C benchmarks).

What does this change do for performance under OS 9? How about
Windows and *nix? I'm wondering if this could be made an option
at VM generation time to allow folks to do some experiments on other

Good work discovering that global vars were accessed differently
under OS X, by the way! I guess it pays to look at the assembly code
every now and then...

	-- John

At 1:20 AM -0800 4/5/02, John M McIntosh wrote:
>It seems back in 1999 I coded up some changes to coerce the 
>definition of VM global variables into a structure and then of course 
>refer to the variables as foo->whatever. Mmm this does allow you to 
>build interp.c with just one global var foo that points to your 
>allocated structure. Someone might find that alone useful.
>But why do this?
>Well on the powerpc global variables are a bit odd. Under os-9 to 
>access one you must do something like
>lwz      r24,method(RTOC)  // load r24 with the address of method 
>based off RTOC (r2)
>lwz      r3,0(r24) // load the word addressed by method into r3.
>But on os-x it's a little different, since you can't load a 32bit 
>address directly as immediate data under ppc architecture you must do 
>it in hi and low 16bit chunks, and the runtime architecture is a bit 
>different and globals aren't found off register 2 or (RTOC). So a 
>somewhat dumb gcc powerpc compile does:
>	addis r9,r31,ha16(L_method$non_lazy_ptr-L96$pb)
>	lwz r9,lo16(L_method$non_lazy_ptr-L96$pb)(r9)
>	lwz r3,0(r9) //r9 is the address of method so load word into r3
>where r31 is setup to point to our storage. and the ha16 and lo16 
>really are the
>high 16 and low 16 bits of the offset into the storage area.
>Usually the optimizing compiler can intermix instructions and the 
>multiple adders etc will solve the address calculations in less 
>cycles that you think.
>However by using a global structure you buy a few things. Under os-9 
>the compiler would preload global addresses like so
>		lwz      r25,instructionPointer(RTOC)
>		lwz      r15,theHomeContext(RTOC)
>		lwz      r27,stackPointer(RTOC)
>		lwz      r20,0(r25)
>		lwz      r23,messageSelector(RTOC)
>		lwz      r13,receiver(RTOC)
>		lwz      r24,method(RTOC)
>		lwz      r26,argumentCount(RTOC)
>		lwz      r28,specialObjectsOop(RTOC)
>		lwz      r29,successFlag(RTOC)
>		lwz      r21,0(r27)
>		lwz      r22,0(r15)
>but by having stuff in a structure we see this instead
>		lwz      r31,foo(RTOC)
>		lwz      r29, at 4589(RTOC)
>		lwz      r26,244(r31)
>		lwz      r30, at 4588(RTOC)
>		lwz      r27,532(r31)
>		lwz      r28,580(r31)
>		lbzu     r25,1(r26)
>7 registers used versus 12 (thus more registers available for other vars)
>plus fewer instructions being executed per routine on average. So it's
>For OS-X it's a bit more complicated. But typically we remove
>two instructions for every global variable reference!
>say this chunk of C
>		CASE(96)
>			/* storeAndPopReceiverVariableBytecode */
>			/* begin fetchNextBytecode */
>			currentBytecode = byteAt(++localIP);
>			t2 = receiver;
>			t1 = longAt(localSP);
>			if (t2 < youngStart) {
>				possibleRootStoreIntovalue(t2, t1);
>			}
>before is
>	addis r9,r31,ha16(L_receiver$non_lazy_ptr-L96$pb)
>	addis r11,r31,ha16(L_youngStart$non_lazy_ptr-L96$pb)
>	lwz r9,lo16(L_receiver$non_lazy_ptr-L96$pb)(r9)
>	lwz r11,lo16(L_youngStart$non_lazy_ptr-L96$pb)(r11)
>	lwz r30,0(r9)
>	lwz r0,0(r11)
>	lbzu r28,1(r26)
>	cmpw cr0,r30,r0
>	lwz r21,0(r27)
>	bc 4,0,L1973
>	mr r3,r30
>	mr r4,r21
>	bl L_possibleRootStoreIntovalue$stub
>after is
>	lwz r23,168(r21) // r21 points to the global structure
>	lwz r0,264(r21)
>	lbzu r28,1(r26)
>	cmpw cr0,r23,r0
>	lwz r20,0(r27)
>	bc 4,0,L1893
>	mr r3,r23
>	mr r4,r20
>	bl L_possibleRootStoreIntovalue$stub
>as you see we remove 4 instructions.
>But lets look at the numbers:
>'44817927 bytecodes/sec; 1476556 sends/sec'
>'44943820 bytecodes/sec; 1481815 sends/sec'
>'44880785 bytecodes/sec; 1483136 sends/sec'
>'44537230 bytecodes/sec; 1468736 sends/sec'
>'44413601 bytecodes/sec; 1481815 sends/sec'
>'45133991 bytecodes/sec; 1487112 sends/sec'
>'44817927 bytecodes/sec; 1488442 sends/sec'
>'44568245 bytecodes/sec; 1480497 sends/sec'
>'44755244 bytecodes/sec; 1477867 sends/sec'
>'44975404 bytecodes/sec; 1485784 sends/sec'
>after with structure, and sqGnu.h changes
>'48706240 bytecodes/sec; 1707379 sends/sec'
>'49230769 bytecodes/sec; 1719372 sends/sec'
>'49042145 bytecodes/sec; 1717179 sends/sec'
>'48929663 bytecodes/sec; 1711720 sends/sec'
>'49192928 bytecodes/sec; 1709547 sends/sec'
>'49155145 bytecodes/sec; 1724879 sends/sec'
>'49117421 bytecodes/sec; 1721570 sends/sec'
>'48892284 bytecodes/sec; 1713900 sends/sec'
>'48780487 bytecodes/sec; 1704137 sends/sec'
>'49344641 bytecodes/sec; 1708462 sends/sec'
>Note the significant jump in sends/sec.
>I've some more thinking to do before releasing...
>But if someone wants to try on other platforms, please email for a change set.
>John M. McIntosh <johnmci at> 1-800-477-2659
>Corporate Smalltalk Consulting Ltd.

More information about the Squeak-dev mailing list