global variables as structure in VM.

Swan, Dean Dean_Swan at Mitel.COM
Fri Apr 5 18:11:01 UTC 2002


	This would be *VERY* useful to me.  One of the odd limitations
of the EPOC OS (on various Psion machines, and the Nokia 9200
communicators) is that fully compliant applications must not use
global variables (other than one pointer to a heap structure), so
this is *exactly* what is necessary to get Squeak going on these

	There has been some discussion about doing this, but nobody
has had the time to actually tackle it, so if you can make this
available, I might allocate some time to finally getting Squeak
running on my Series 5mx (36 MHz ARM7).

					-Dean Swan

-----Original Message-----
From: John M McIntosh [mailto:johnmci at]
Sent: Friday, April 05, 2002 4:21 AM
To: squeak-dev at
Subject: global variables as structure in VM.

It seems back in 1999 I coded up some changes to coerce the 
definition of VM global variables into a structure and then of course 
refer to the variables as foo->whatever. Mmm this does allow you to 
build interp.c with just one global var foo that points to your 
allocated structure. Someone might find that alone useful.

But why do this?

Well on the powerpc global variables are a bit odd. Under os-9 to 
access one you must do something like

lwz      r24,method(RTOC)  // load r24 with the address of method 
based off RTOC (r2)
lwz      r3,0(r24) // load the word addressed by method into r3.

But on os-x it's a little different, since you can't load a 32bit 
address directly as immediate data under ppc architecture you must do 
it in hi and low 16bit chunks, and the runtime architecture is a bit 
different and globals aren't found off register 2 or (RTOC). So a 
somewhat dumb gcc powerpc compile does:

	addis r9,r31,ha16(L_method$non_lazy_ptr-L96$pb)
	lwz r9,lo16(L_method$non_lazy_ptr-L96$pb)(r9)
	lwz r3,0(r9) //r9 is the address of method so load word into r3

where r31 is setup to point to our storage. and the ha16 and lo16 
really are the
high 16 and low 16 bits of the offset into the storage area.

Usually the optimizing compiler can intermix instructions and the 
multiple adders etc will solve the address calculations in less 
cycles that you think.
However by using a global structure you buy a few things. Under os-9 
the compiler would preload global addresses like so

		lwz      r25,instructionPointer(RTOC)
		lwz      r15,theHomeContext(RTOC)
		lwz      r27,stackPointer(RTOC)
		lwz      r20,0(r25)
		lwz      r23,messageSelector(RTOC)
		lwz      r13,receiver(RTOC)
		lwz      r24,method(RTOC)
		lwz      r26,argumentCount(RTOC)
		lwz      r28,specialObjectsOop(RTOC)
		lwz      r29,successFlag(RTOC)
		lwz      r21,0(r27)
		lwz      r22,0(r15)

but by having stuff in a structure we see this instead

		lwz      r31,foo(RTOC)
		lwz      r29, at 4589(RTOC)
		lwz      r26,244(r31)
		lwz      r30, at 4588(RTOC)
		lwz      r27,532(r31)
		lwz      r28,580(r31)
		lbzu     r25,1(r26)

7 registers used versus 12 (thus more registers available for other vars)
plus fewer instructions being executed per routine on average. So it's

For OS-X it's a bit more complicated. But typically we remove
two instructions for every global variable reference!

say this chunk of C

			/* storeAndPopReceiverVariableBytecode */
			/* begin fetchNextBytecode */
			currentBytecode = byteAt(++localIP);
			t2 = receiver;
			t1 = longAt(localSP);
			if (t2 < youngStart) {
				possibleRootStoreIntovalue(t2, t1);

before is

	addis r9,r31,ha16(L_receiver$non_lazy_ptr-L96$pb)
	addis r11,r31,ha16(L_youngStart$non_lazy_ptr-L96$pb)
	lwz r9,lo16(L_receiver$non_lazy_ptr-L96$pb)(r9)
	lwz r11,lo16(L_youngStart$non_lazy_ptr-L96$pb)(r11)
	lwz r30,0(r9)
	lwz r0,0(r11)
	lbzu r28,1(r26)
	cmpw cr0,r30,r0
	lwz r21,0(r27)
	bc 4,0,L1973
	mr r3,r30
	mr r4,r21
	bl L_possibleRootStoreIntovalue$stub

after is

	lwz r23,168(r21) // r21 points to the global structure
	lwz r0,264(r21)
	lbzu r28,1(r26)
	cmpw cr0,r23,r0
	lwz r20,0(r27)
	bc 4,0,L1893
	mr r3,r23
	mr r4,r20
	bl L_possibleRootStoreIntovalue$stub

as you see we remove 4 instructions.

But lets look at the numbers:
'44817927 bytecodes/sec; 1476556 sends/sec'
'44943820 bytecodes/sec; 1481815 sends/sec'
'44880785 bytecodes/sec; 1483136 sends/sec'
'44537230 bytecodes/sec; 1468736 sends/sec'
'44413601 bytecodes/sec; 1481815 sends/sec'
'45133991 bytecodes/sec; 1487112 sends/sec'
'44817927 bytecodes/sec; 1488442 sends/sec'
'44568245 bytecodes/sec; 1480497 sends/sec'
'44755244 bytecodes/sec; 1477867 sends/sec'
'44975404 bytecodes/sec; 1485784 sends/sec'

after with structure, and sqGnu.h changes
'48706240 bytecodes/sec; 1707379 sends/sec'
'49230769 bytecodes/sec; 1719372 sends/sec'
'49042145 bytecodes/sec; 1717179 sends/sec'
'48929663 bytecodes/sec; 1711720 sends/sec'
'49192928 bytecodes/sec; 1709547 sends/sec'
'49155145 bytecodes/sec; 1724879 sends/sec'
'49117421 bytecodes/sec; 1721570 sends/sec'
'48892284 bytecodes/sec; 1713900 sends/sec'
'48780487 bytecodes/sec; 1704137 sends/sec'
'49344641 bytecodes/sec; 1708462 sends/sec'

Note the significant jump in sends/sec.
I've some more thinking to do before releasing...
But if someone wants to try on other platforms, please email for a change set.
John M. McIntosh <johnmci at> 1-800-477-2659
Corporate Smalltalk Consulting Ltd.

More information about the Squeak-dev mailing list