[ENH][VM] Improved code generation (hopefully ;)

John M McIntosh johnmci at mac.com
Wed Jul 9 01:17:35 UTC 2003


> From:   "Andreas Raab" < andreas.raab at g... >
> Date:   Tue Jul 8, 2003  10:55 pm
> Subject:   RE: [ENH][VM] Improved code generation (hopefully ;)
>
>
>
> Hi,
>
> > I've been looking at commonReturn and started to wonder why
> > the usage of
> >
> > localCntx _ localReturnContext.
> > localVal _ localReturnValue.
>
> Mostly because I wasn't sure if all compilers will be able to  
> recognize that
> these values are in fact used read-only for the branch. E.g., a "lesser
> compiler" might not look at the branch individually and just say "well  
> that
> gets read and written all over the places so I won't even try to  
> optimize it
> into registers". If you got a smart compiler then the first thing  
> it'll do
> is to recognize that this really *is* an alias and no bad things will  
> happen
> as a result. Or at least that's what I think ;-)
>
> In addition, I wanted to be able to play around with making these  
> variables
> global and that may hurt PPC and others as long as there aren't any  
> explicit
> hints given that these values should be kept in registers (see all  
> those
> "keep in register" comments spread out through interpreter).

Well they don't get made global because of the fact they are only used  
in one routine, it doesn't hurt
the ppc code since we've lots of registers to work with. Making them  
truly global will hurt things for powerpc.

>
> > You see with the current VMMaker any sole usage of a instance
> > variable in Interpreter will get folded
> > into a local variable versus a global if all references to that
> > instance variable inline into a single C procedure.
>
> WHAT??? Are you trying to tell me that if I add an instance variable to
> class Interpreter which is only used in a single method (most likely
> interpret()) it will automatically become a temp of that method? Where  
> can I
> turn this off? I don't want it - it's bad, bad, bad! Global variables  
> are
> cheap, stack relative addressing is expensive if your registers are  
> already
> cramped up with localIP,SP, etc.

If you have an accessor for the global, then it won't become localized  
because
then 2 or more procedures access the global.  The reason for this code  
was to optimize
the garbage collector, which is  spread across multiple methods, but  
mostly each
phase (markAndTrace/Sweep) becomes completely inlined, so the working  
variables  are quite localized.
For the powerpc this make a significant improvement upwards of 50%.

See the C Code Generator  localizeGlobalVariables methods. Maybe you  
can run it off for INTEL, but ensure you don't greatly impact GC  
benchmarks before doing this.

as for your interpret() question mine has defined
int interpret(void) {
#ifdef FOO_REG
     register struct foo * foo FOO_REG = &fum;
#endif
     int localReturnValue;
     int localReturnContext;
     int localHomeContext;
     register char* localSP SP_REG;
     register char* localIP IP_REG;
     register int currentBytecode CB_REG;

which I think are all there on purpose? If you were to add a global  
foobarCounter and only reference it
in an inlined procedure in interpret() then it would become a local  
variable, unless you make an accessor for it.


>
> > However I'll point out that GCC and codewarrior with
> > optimization decides we are idiots and ignores
> > the localCntx/Val constucts because they are read only and
> > fold back to the localReturnContrext/Value
>
> Which is exactly what a good compiler _should_ do ;-)
>
> > I did create a change set to do this, but then in doing this
> > I reviewed the usage of externalizeIPandSP and decided one of
> > the main usage is to avoid issues with positive32BitIntegerFor:
>
> I don't understand you. What issues are you talking about and why is  
> it the
> "main" usage? Both, externalizeIPandSP as well as internalizeIPandSP  
> are
> used to transfer state between interpret() and the rest of the system  
> so
> that whenever we get out of interpret() we can still refer to the
> instruction pointer and the stack pointer. I don't see what this has  
> to do
> with #positive32BitIntegerFor:.

I believe the reason for invoking externalizeIPandSP for 'most' calls  
which for example
the bit math routines are to guard against positive32BitIntegerFor  
allocating a largeInteger. The
floating routines I don't consider because they usually allocate a  
float object. But I'll think more
on if it's worth doing.

>
> > Which points to considering inlining the bit bytecode
> > primitives, and primitivePointX & primitivePointY
>
> Do you have any benchmarks that show the effect? I hate adding  
> complexity
> without actually improving anything.

Well the primitivePointX and primitivePointY are simple.
| p a b |
p _ Point x: 1 y: 2.
v _ Time millisecondsToRun: [10000000 timesRepeat: [a _ p x. b _ p y]].
^v

gives 8572 & 9151 before
8140 & 7964 after

primitivePointX
	| rcvr |
	self inline: false.
	rcvr _ self popStack.
	self assertClassOf: rcvr is: (self splObj: ClassPoint).
	successFlag
		ifTrue: [self push: (self fetchPointer: XIndex ofObject: rcvr)]
		ifFalse: [self unPop: 1]

becomes

primitivePointX
	| rcvr |
	self inline: true.
	rcvr _ self internalStackTop.
	self internalPop: 1.	
	self assertClassOf: rcvr is: (self splObj: ClassPoint).
	successFlag
		ifTrue: [self internalPush: (self fetchPointer: XIndex ofObject:  
rcvr)]
		ifFalse: [self internalUnPop: 1]


>
> > Also this brought back another memory, I'm sure Anthony
> > (years?) back pointed out that usage of
> > instantiateSmallClasssizeInBytesfill results in filling the
> > allocated object with 0 or nil, but then we just fill the object with
> > data right right away, this is silly. Think we could follow thru on
> > his idea?
>
> Well, I think we may want to consider a new primitive which is capable  
> of
> allocating bit objects without initialization (how about  
> #primitiveDirtyNew
> ;-) The code could be used in those places where we care on the  
> VM-level
> such as float or large integer allocation. BTW, I'm no big fan of  
> making
> this the default for bit objects - it gives you the ability to read
> arbitrary memory from Squeak and that's a big security risk (think  
> about a
> situation in which we just entered a password and afterwards we simply
> allocate a huge chunk of memory and search for this password). If we  
> add
> this ability it should be the exceptional case and not the default.
no problem there.

>
> And again, before doing anything alike I want to see benchmark results.
>
> > And why does signed32BitIntegerFor: use instantiateClass, versus
> > instantiateSmallClasssizeInBytesfill?
>
> At the time I wrote this, it had (literally) no users. Certainly not  
> in any
> critical places. If you can show me some improvement in any benchmarks  
> you
> devise we can certainly change it ;-)

Well it's just a consistency thing, because positive32BitIntegerFor: &  
the signed/positive64BitIntegersFor: all use the  
instantiateSmallClasssizeInBytesfill:  and I was wondering if there was  
some technical reason for not using
it.

instantiateSmallClasssizeInBytesfill is used by
floatObjectOf,
makePointWIthxValue:yValue:,
positive32BitIntegerFor:
positive64BitIntegerFor:
signed64BitIntegerFor:

  I'll note that LargeIntegersPlugin is a heavy user of instantiateClass  
and there I really wonder about filling with zeros, then for the most  
part refilling with the actual data.

So given that we've already got the allocator setup to do no fill,  
because that's how contexts are allocated, I made a  
instantiateSmallClass: classPointer sizeInBytes: sizeInBytes

and for
| v |
v _ Time millisecondsToRun: [10000000 timesRepeat: [1 asFloat]].
^v

Before 13081,14205,12756
After 12679,12899,12470

Gain is slight, more smarter tools showed that the fill in  
instantiateSmallClasssizeInBytesfill:
took about 15% of the cycles, so it is significant for the execution of  
the routine.

If I try
| p |
p _ InputSensor new.
v _ Time millisecondsToRun: [10000000 timesRepeat: [p primMousePt]].

then times like
before 32688
after 30763
are likely

So I'll wrap up a change set later tonight (4-6 hours out).

I'll note that Point x: y: doesn't invoke makePoint  
bytecode?/primitive?  or am I missing something here?
x: xInteger y: yInteger
	"Answer an instance of me with coordinates xInteger and yInteger."

	^self new setX: xInteger setY: yInteger

Ah, yes,  in number @ falls back to Point x:y: on failure.

  10,000,000 Point x: 1 y: 2 takes 22.815 seconds, but 1 at 2 takes 12,033,  
  really quite significant.

But then are other usage of Point x: y: in the image valid?

say in POVertex or BlobMorph?


--
======================================================================== 
===
John M. McIntosh <johnmci at smalltalkconsulting.com> 1-800-477-2659
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
======================================================================== 
===



More information about the Squeak-dev mailing list