[Vm-dev] [NB] NativeBoost meets JIT

Fri Sep 21 20:49:44 UTC 2012

On 21 September 2012 19:50, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> Hi Igor,
>
>     great news!
>
> On Fri, Sep 21, 2012 at 7:59 AM, Igor Stasenko <siguctua at gmail.com> wrote:
>>
>>
>> Hello there,
>>
>> so, we're entered a new area, where native code, generated from image
>> side can be run directly by JIT.
>> This feature was one of the first things which i wanted to try, once
>> Eliot released Cog :)
>>
>> The way how we do that, is when VM decides to JIT a specific method,
>> we copying the native code (from method trailer)
>> directly into the method's code.
>> All you need to do is to use special primitive for that 220 (
>> #primitiveVoltage)
>>
>> So, a first question, which we wanted to be answered is how faster to
>> run native code by JIT,
>> comparing to running native code via NativeBoost primitive , which is
>> #primitiveNativeCall..
>>
>> For here are methods, which just answer 42:
>>
>> This one using #primitiveNativeCall
>>
>> nbFoo2
>>         <primitive: #primitiveNativeCall module: #NativeBoostPlugin error: errorCode>
>>
>>         ^ NBNativeCodeGen methodAssembly: [:gen :proxy :asm |
>>                 asm noStackFrame.
>>                 asm
>>                         mov: (42 << 1) + 1 to: asm EAX;
>>                         ret.
>>         ]
>>
>> And this one uses JIT:
>>
>> nbFoo
>>         <primitive: 220 error: errorCode>
>>
>>         [ errorCode = ErrRunningViaInterpreter  ] whileTrue: [ ^ self nbFoo ].
>>
>>         ^ NBNativeCodeGen jitMethodAssembly: [:gen :proxy :asm |
>>                 asm noStackFrame.
>>                 asm
>>                         mov: (42 << 1) + 1 to: asm EDX;
>>                         ret: 4 asUImm.
>>         ]
>>
>> And this one is code which JIT can do:
>>
>> nbFoo42
>>         ^ 42
>>
>> So, here the numbers:
>>
>> Time to run via #primitiveNativeCall :
>>
>> [100000000 timesRepeat: [ MyClass nbFoo2  ] ] timeToRun
>>  6995
>>
>> Time to run via JIT:
>>
>> [100000000 timesRepeat: [ MyClass nbFoo  ] ] timeToRun
>> 897
>>
>> Time to run JITed method:
>>
>> [100000000 timesRepeat: [ MyClass nbFoo42  ] ] timeToRun
>> 899
>>
>> so, as you can see, the JITed method and our custom generated code is
>> on par (which is logical ;).
>>
>> Time to run an empty loop:
>>
>> [100000000 timesRepeat: [  ] ] timeToRun 679
>>
>>
>> So, here the result, if we extract the loop overhead, we can see the
>> difference in
>> calling our native code when it uses JIT vs using #primitiveNativeCall :
>>
>> (6995 - 679 ) / (897- 679) asFloat 28.972477064220183
>>
>> 28 times faster!!!!
>>
>> So, with this new feature, we now can make our generated code to run
>> with unmatched speed,
>> without overhead related to #primitiveNativeCall.
>> This is especially useful for implementing primives which involving
>> heavy numeric crunching.
>>
>> I would release this code to public, but there's one little
>> discrepancy i need to deal with first:
>>
>> (one little problem, which i hope Eliot can help to solve)
>>
>>  it looks like primitivePerform: never enters the JIT mode, but always
>> executing the method via interpreter.
>
>
> I'll take a look.  This is all very detailed so I'll need a little time.
>

Heh.. it took me a while (more than a year) before i was able to
understand how i can hook in.. sure i did not spent whole year working
on that ;) , but anyways ,
i am not expecting immediate answer from you :)

>> This is why you see this code:
>>         [ errorCode = ErrRunningViaInterpreter  ] whileTrue: [ ^ self nbFoo ].
>>
>> because if i do it inside of NBNativeCodeGen>>jitMethodAssembly:,
>> which checks for same error and retries the send using perform
>> primitive, it never enters the JIT mode,
>> resulting in endless loop :(
>>
>> This is despite the fact that method is JITed, because we enforce the
>> JITing of that method during error handling:
>>
>>         lastError = ErrRunningViaInterpreter ifTrue: [
>>                 "a method contains native code, but executed by interpreter "
>>                 method forceJIT ifFalse: [ self error: 'Failed to JIT the compiled
>> method. Try reducing it''s size ' ].
>>                 ^ self retrySend: aContext
>>                 ].
>>
>> The #forceJit is the primitive which i implemented like following:
>>
>> primitiveForceJIT
>>
>>         <export: true >
>>
>>         | val result |
>>
>>         val := self stackTop.
>>
>>         (self isIntegerObject: val) ifTrue: [ ^ self primitiveFail ].
>>         (self isCompiledMethod: val) ifFalse: [ ^ self primitiveFail ].
>>
>>         (self methodHasCogMethod: val) ifFalse: [
>>                 cogit cog: val selector: objectMemory nilObject ].
>>
>>         result := (self methodHasCogMethod: val ) ifTrue: [ objectMemory
>> trueObject ] ifFalse: [ objectMemory falseObject ].
>>
>>         ^ self pop: 1 thenPush: result.
>>
>> As you can see from its usage, if VM, for some reason will fail to jit
>> the method, the primitive will answer false,
>> and we will stop with an error.. Which apparently never happens.
>> Still, a #primitivePerform seems like ignoring that the method
>> contains machine code an always runs it interpreted :(
>>
>> I do not like the idea, that users will be forced to manually put such
>> loops in every method they will write..
>> any ideas/suggestions how to overcome that?
>
>
> Yes.  The JIT should be told that methods that have NB code should be jitted.  But right now I don't understand enough of how NB code is generated and methods marked that they have NB code etc to know exactly how to do this.  I need to play around a bit.
>

Let me explain some internal bits, to make it clear:
It is not really matters how code is generated.. From VM's side of
view it is simple:
it takes bytes from Compiled method's trailer, and copies it to JIT
method during code generation.

The hook for that is the 220-voltage ;) primitive , which i put it
into #initializePrimitiveTableForSqueakV3,
like that, when cog jits the method, it calls the 'code generator' for
that primitive - #genPrimitiveNBNativeCall,
which does nothing but directly copies the bytes from method's trailer
into generated code,
or fails if there's none:

-------------------
genPrimitiveNBNativeCall
	| len trailer codeOffset instr |
	len := (objectMemory lengthOf: methodObj).

	trailer := (coInterpreter byteAt: methodObj + BaseHeaderSize + len-1 ).
	(trailer bitAnd: 2r11111100) = 40 " Native code trailer id "
		ifFalse: [ ^ -1"... fail somehow " ].

	"the next two bytes should be an offset for a native code start"
	codeOffset := (self byteAt: methodObj + BaseHeaderSize + len-4 ) +
((self byteAt: methodObj + BaseHeaderSize + len-5 ) << 8).

	"entry point address is method oop + header + len - codeOffset"

	instr := (self cCoerce: (objectMemory firstFixedField: methodObj) to:
'sqInt') + len - codeOffset.

	"copy generated code"	
	[ instr < (methodObj + len - 5) ] whileTrue: [
		self Fill32: (objectMemory longAt: instr ).
		instr := instr + 4.
	].

	^ 0
---------------

Like that, the produced JITed method will contain the native code in
place of its primitive code.
The bytecode of the method is still generated as usual..
because native code might want to fail the prim (and then it should
enter the method's body).
But as i told before, on failure in native code i'd rather switch back
to interpreter and run method's body interpreted , because method can
often contain a lot of assembler code (since you providing its
implementation in assembler), but jiting that code makes no sense at
all,
because it is run just once and if jited, will simply waste space.

Initially, the primitive itself (220) was not even implemented at all
(so if you execute the method by interpreter, it will simply fail and
enter the method's body), but then i added implementation,
 which also always fails, but reports different error codes, depending
if executed method has native code in its trailer or not.

In future, i could make a simple change in #methodShouldBeCogged: to
check if that method
contains primitive 220 + already have native code in trailer, and so
it will flag that method to be cogged,
regardless of anything.

But as i said, for #primitivePerform it looks like it doesn't matters
whether method is cogged or not,
it always executing it interpreted..

I also found i unable to force run jited method from doits, i.e. if i do:

MyClass foo

despite that #foo method is already jited (guaranteed), it always run
it interpreted.

But if i do:

(1 to: 10) collect: [:i | ([ MyClass foo ] on: NBNativeCodeError do: [] ) ]

it yielding following result:
 #(nil 42 42 42 42 42 42 42 42 42)

which shows that it starts using jited version of method only after
the outer method is jited (the doit itself).

Another thing which i suspecting of, that since i using 'thisContext
sender', to take the method
and its arguments, in order to retry the very same message send, this
might cause deoptimizations on stack i suppose, which in own turn
makes that piece of code impossible to run by JIT.

Since the NB code generation performed once for method , and after
installing the method's native code it never enters the method's body
(unless native code fails the prim), i don't really care how fast/slow
the code generation is, and whether it runs deoptimized or not,
what i care is that it should be able to retry the same message-send
after it done generating code,
so it works seamlessly and users don't need to write any additional
code to handle it.

-- 
Best regards,
Igor Stasenko.