[Vm-dev] Changing CogSimulator abi

Wed Jan 30 00:15:59 UTC 2013

On Tue, Jan 29, 2013 at 1:59 PM, Lars <lars.wassermann at googlemail.com>wrote:

>  Of those four calling conventions, I was only aware of the first and the
> third. The explanation about the difference between (3) and (4) clarifies
> your remarks about the changes to the compileAbort method.
>
> But the point of the initial mail stands as a question:
>     If the interpreter does only understand platform C calling convention
> (1) (and virtual St-St(4)), don't we have to change the simulator to be
> able to use it with the ARM-Trampolines?
>

No I don't think so.  The trick is in
Cogit>>handleCallOrJumpSimulationTrap: and each processor
alien's simulateCallOf:nextpc:memory: method.  In the simulator, when a
trampoline executes a call instruction that calls an interpreter routine it
calls an illegal address and that causes a ProcessorSimulationTrap
exception which is caught by Cogit>>simulateCogCodeAt:, which defers to
handleSimulationTrap:, which defers via handleCallOrJumpSimulationTrap:
to simulateCallOf:nextpc:memory:.  In the case of BochsIA32Alien this
pushes the next pc (pushes the return address), builds a stub frame and
sets the instruction pointer to the illegal address, i.e. it simulates the
call of the interpreter routine.  In GdbARMAlien it isn't implemented yet.
 It simply sets the link register and the pc.  But it should really
construct a frame that looks like a simple C frame, so it should read
something like:

simulateCallOf: address nextpc: nextpc memory: aMemory
"Simulate a frame-building call of address.  Build a frame since
 a) this is used for calls into the run-time which are unlikely to be
leaf-calls, and
b) stack alignment needs to be realistic for assert checking for platforms
such as Mac OS X.
 N.B. r11 is typically the platform's frame pointer, if it uses one."
self pushWord: nextpc in: aMemory.
 self pushWord: self r11 in: aMemory.
self r11: self sp.
self pc: address

I don't know the details (e.g. whether frames do save a frame pointer, and
whether r112 is used for the frame pointer).  Copy the platform's C
compiler.

Now simulateCallOf:nextpc:memory: pairs with simulateReturnIn:.

Further, simulateLeafCallOf:nextpc:memory: pairs with simulateLeafReturnIn:
and these are easy; you've already implemented the first:

simulateLeafCallOf: address nextpc: nextpc memory: aMemory
self lr: nextpc.
self pc: address

and that should pair with:
simulateLeafReturnIn: aMemory
self pc: self lr

Can I cc this to vm-dev?

2013/01/29 9:48 pm Eliot Miranda
<eliot.miranda at gmail.com><eliot.miranda at gmail.com>
> :
>
>
>
> On Tue, Jan 29, 2013 at 1:01 AM, Lars <lars.wassermann at googlemail.com>wrote:
>
>>  Hi Eliot,
>> it seems I am not solving the right problem. As far as I understood, we
>> have to support ARM abi, because the (gcc compiled) interpreter is expected
>> to be called that way. What we do within the (JIT compiled) machine code is
>> up to us.
>>
>> But how I understood your email is the opposite: The translated
>> interpreter will always adhere to IA32 abi, and only within machine code,
>> we want to push the LinkReg, etc.
>> How is that possible? Are there flags when compiling the c-code for ARM
>> to use IA32 abi instead?
>>
>> Or is my mental model still off?
>>
>
>  yes, but only a little :).  there are four calling conventions to think
> about, one virtual.  In no particular order they are:
>
>  One is the platform's C calling convention.  This is defined by the
> platform and not something we can decide.  It must be used whenever we cal
> a function in the interpreter, be it a run-time routine or a primitive.
>  Most of the run-time routines are called through trampolines and these
> trampolines must convert their input arguments into a valid call on the
> relevant interpreter routine according to the platform ABI.
>
>  Two is the trampoline calling convention(s).  This is purely
> register-based, and is used for generated machine-code to call the
> interpreter.  These are defined by the call instruction used to invoke
> them.  On X86 the return address will be passed on the stack.  On ARM it
> will be passed in the linkRegister (I think) and pushed there-in (in those
> trampolines that need to return back).
>
>  Three is the Smalltalk-to-Smalltalk calling convention used in sends,
> here, like two, n X86 the return address will be passed on the stack.  On
> ARM it will be passed in the linkRegister and pushed in frame-building
> code.  This convention is register-based for 0 and 1 argument sends, and
> both register-and-stack-based for > 1 argument sends (with the receiver and
> the class/selector always passed in a register).
>
>  Four is the virtual form of three, which is observed by the interpreter
> at various send failure points.  In this calling convention the return
> address is always passed on the stack, and is used by the interpreter to
> find the method or PIC in which a send has failed, and beneath that is the
> return address of the failing send call, which the interpreter uses to
> locate the send site that may be modified to maintain the inline cache.
>
>  So calling convention one is defined by the platform and we must adhere
> to it.
>
>  Calling convention two is a simple fast limited calling convention for a
> limited set of calls into the interpreter that insulates the generated
> machine code from the platform's calling conventions.
>
> Calling convention three is a simple fast calling convention used for
> machine-code Smalltalk-to-Smalltalk calls.
>
>  Calling convention four is a virtualization of calling convention three
> that insulated the interpreter from the processor's implementation of call
> and return instructions.
>
>  Does this resolve things?
>
>   Best, Lars
>>
>
>  cheers!
>
>
>>
>> 2013/01/28 10:28 pm Eliot Miranda <eliot.miranda at gmail.com><eliot.miranda at gmail.com>
>> :
>>
>> Hi Lars,
>>
>> On Sat, Jan 26, 2013 at 1:16 PM, Lars <lars.wassermann at googlemail.com>wrote:
>>
>>> Hello Eliot, hello vm-dev,
>>>
>>> @vm-dev: I'm still sometimes working on cog ARM, but due to my studies I
>>> have little time. The problem I'm working on is that IA32 has a different
>>> function call ABI than ARM. While on IA32, you need to push the return
>>> address, on ARM, you load it into the LR-register.
>>>
>>> A design decision to accommodate this difference in the ARM JIT was to
>>> use IA32 ABI within all cog code, even when running on ARM. Only when
>>> calling the (compiled) interpreter, we use ARM ABI. The hope was, that this
>>> way we need to change little of the existing code.
>>>
>>> @all: In the last days of working (spread across several months), I
>>> implemented the Call opcode (which is used by cogit whenever a function is
>>> called) by pushing the return address before branching to the target (IA32
>>> ABI).
>>> Also, I changed the trampoline generation to ask the compiler for the
>>> appropriate call opcode for the ABI (so far not committed), which is either
>>> Call in case of IA32 or BL in case of ARM. I'm not happy with that location
>>> for this behavior, but I don't know whether there exists a better place.
>>> Also, #hasLinkRegister is implemented on the compiler.
>>>
>>> Now, that calling the interpreter has changed, I run into the problem,
>>> that the simulator is expecting the stack pointer to point to the return
>>> address. The simulator is assuming IA32 ABI.
>>>
>>> How best to attribute for the changed ABI in the simulator?
>>>     Subclass the simulator? On which level, VMSimulator or
>>> VMSimulatorLSB? That change would be orthogonal to the LSB subclass (if
>>> there ever will be a MSB subclass).
>>>     Or introduce two classes which do know the ABI and are responsible
>>> for all places where ABI is used? Also the eventual changes to trampoline
>>> and enilopmart generation? Which problems might arise from this design
>>> decision with respect to the C-translation?
>>>
>>
>>
>>  I would take the same approach that Peter Deutsch took in HPS, the
>> VisualWorks VM.  The idea is to keep the Interpreter side of things
>> unchanged and change the glue code and/or the generated method prologue
>> code to keep the stack the same from the Interpreter's point of view.  So
>> when an ARM machine code method calls another ARM machine code method the
>> link register is in use, and the frame building code in a frame-building
>> non-leaf method pushes the link register as part of building the frame (as
>> one would expect), and a frameless method may be able to return through the
>> link register if it contains no runtime calls, but wold have to if it does
>> (*).  But if a machine-code method calls the run-time through glue it would
>> push the link register at some point before the glue call, leaving the
>> stack in the same state as it would be in the IA32 version at the same
>> point in execution.
>>
>>  For example, here's the prolog for a normal method, expressed in the
>> VM's assembler:
>>
>>  LstackOverflow:
>>  MoveCq: 0 R: ReceiverResultReg
>> LsendMiss:
>>  Call: ceMethodAbortTrampoline
>>  AlignmentNops: (BytesPerWord max: 8)
>> Lentry:
>>  objectRepresentation getInlineCacheClassTagFrom: ReceiverResultReg
>> into: TempReg
>>  CmpR: ClassReg R: TempReg
>>  JumpNonZero: LsendMiss:
>> LnoCheckEntry:
>>  ... frame bulding code ...
>>  MoveAw: coInterpreter stackLimitAddress R: TempReg
>>  CmpR: TempReg R: SPReg
>>  JumpBelow: LstackOverflow
>>
>>  The ceMethodAbort handles both the send miss when the inline cache
>> fails, and stack overflow at the end of a stack page or to check for
>> events.  The link register defnitely needs to be pushed for the send miss.
>>  It doesn't need to be pushed for the stack overflow (since frame build
>> code has already saved it in the return pc slot in the frame), but pushing
>> it unnecessarily can be undone by the glue for ceMethodAbortTrampoline.
>>
>>  So the abort code would become
>>
>>  LstackOverflow:
>>  MoveCq: 0 R: ReceiverResultReg
>> LsendMiss:
>>  Push: LinkReg
>>  Call: ceMethodAbortTrampoline
>>  AlignmentNops: (BytesPerWord max: 8)
>>  ...
>>
>>  and in ceMethodAbortTrampoline there would be a test
>> on ReceiverResultReg so that if ReceiverResultReg is 0 (the stack overflow
>> case) the link register is written to the same stack slot as it was pushed
>> to, so that the top of stack is the return address for
>> the ceMethodAbortTrampoline call, and if ReceiverResultReg is non-zero (the
>> send miss case), the link register is pushed, so that the inner return
>> address on top of stack is the return address for
>> the ceMethodAbortTrampoline call and the outer return address is that for
>> the send call that missed.  The return addresses are used to identify the
>> method (whose selector is the selector of the send) and the calsite at
>> which the send missed.
>>
>>  So with a little modification in the right places the Interpreter sees
>> exactly the same stack with ARM machine code as it does on IA32.  In fact
>> we can construct tests to ensure this is the case by running two VMs side
>> by side, running some test image that exercises the send machinery etc.
>>
>>  As far as the code codes it might look something like:
>>
>>  *Cogit methods for compile abstract instructions*
>>  *compileAbort*
>>  "*The start of a CogMethod has a call to a run-time abort routine that
>> either*
>> * handles an in-line cache failure or a stack overflow.  The routine
>> selects the*
>> * path depending on ReceiverResultReg; if zero it takes the stack
>> overflow*
>> * path; if nonzero the in-line cache miss path.  Neither of these paths
>> returns.*
>> * The abort routine must be called;  In the callee the method is located
>> by*
>> * adding the relevant offset to the return address of the call.*"
>>  stackOverflowCall := self MoveCq: 0 R: ReceiverResultReg.
>>  backEnd hasLinkRegister ifTrue:
>>  [self PushR: LinkReg].
>>  sendMissCall := self Call: (self methodAbortTrampolineFor:
>> methodOrBlockNumArgs)
>>
>>  StackToRegisterMappingCogit methods for initialization
>>  genMethodAbortTrampolineFor: numArgs
>>   "Generate the abort for a method.  This abort performs either a call
>> of ceSICMiss:
>>  to handle a single-in-line cache miss or a call of ceStackOverflow: to
>> handle a
>>  stack overflow.  It distinguishes the two by testing ResultReceiverReg.
>>  If the
>>  register is zero then this is a stack-overflow because a) the receiver
>> has already
>>  been pushed and so can be set to zero before calling the abort, and b)
>> the
>>  receiver must always contain an object (and hence be non-zero) on SIC
>> miss."
>>  | jumpSICMiss |
>>  <var: #jumpSICMiss type: #'AbstractInstruction *'>
>>  opcodeIndex := 0.
>>  self CmpCq: 0 R: ReceiverResultReg.
>>  jumpSICMiss := self JumpNonZero: 0.
>>  backEnd hasLinkRegister ifTrue:
>>  [self MoveR: LinkReg Mw: 0 r: SPReg]. "overwrite send ret address with
>> ceMethodAbortTrampoline call ret address"
>>  self compileTrampolineFor: #ceStackOverflow:
>>  callJumpBar: true
>>  numArgs: 1
>>  arg: SendNumArgsReg
>>  arg: nil
>>  arg: nil
>>  arg: nil
>>  saveRegs: false
>>  resultReg: nil.
>>  jumpSICMiss jmpTarget: self Label.
>>  backEnd hasLinkRegister ifTrue:
>>  [self PushR: LinkReg]. "push ret address for ceMethodAbortTrampoline
>> call"
>>   ...
>>
>>
>>  The same goes for the aborts in closed and open PICs.  Does this make
>> sense?
>> (*) I'm not sure without looking at the code carefully whether any
>> frameless methods can make calls on the runtime.  If not, then this issue
>> is moot.  If so, then one solution is to not compile the method frameless
>> if it makes use of the run-time.  Another approach would be to build a
>> simple frame (just push the link register).
>>
>>
>>> All the best,
>>> Lars
>>>
>>>
>>>
>>
>>
>>  --
>> best,
>> Eliot
>>
>>
>>
>
>
>  --
> best,
> Eliot
>
>
>

-- 
best,
Eliot

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20130129/7fae6c1a/attachment-0001.htm