[Vm-dev] Changing CogSimulator abi

Wed Jan 30 01:11:28 UTC 2013

On Tue, Jan 29, 2013 at 4:59 PM, Jeremy Kajikawa
<jeremy.kajikawa at gmail.com>wrote:

>
>
> Are these not SysV register calling conventions?
>

Only convention one.

> Is the ARM ABI anything like the PowerPC ABI for register usage?
>
Yes.

> And is there multiple value returnable arguments to functions?
>
Not in C.  Neither in Smalltalk.

> Or is everything limited to single result returns only?
>
Yes.

> Jeremy
> On Jan 30, 2013 9:16 AM, "Eliot Miranda" <eliot.miranda at gmail.com> wrote:
>
>>
>> On Tue, Jan 29, 2013 at 1:59 PM, Lars <lars.wassermann at googlemail.com>wrote:
>>
>>>  Of those four calling conventions, I was only aware of the first and
>>> the third. The explanation about the difference between (3) and (4)
>>> clarifies your remarks about the changes to the compileAbort method.
>>>
>>> But the point of the initial mail stands as a question:
>>>     If the interpreter does only understand platform C calling
>>> convention (1) (and virtual St-St(4)), don't we have to change the
>>> simulator to be able to use it with the ARM-Trampolines?
>>>
>>
>> No I don't think so.  The trick is in
>> Cogit>>handleCallOrJumpSimulationTrap: and each processor
>> alien's simulateCallOf:nextpc:memory: method.  In the simulator, when a
>> trampoline executes a call instruction that calls an interpreter routine it
>> calls an illegal address and that causes a ProcessorSimulationTrap
>> exception which is caught by Cogit>>simulateCogCodeAt:, which defers to
>> handleSimulationTrap:, which defers via handleCallOrJumpSimulationTrap:
>> to simulateCallOf:nextpc:memory:.  In the case of BochsIA32Alien this
>> pushes the next pc (pushes the return address), builds a stub frame and
>> sets the instruction pointer to the illegal address, i.e. it simulates the
>> call of the interpreter routine.  In GdbARMAlien it isn't implemented yet.
>>  It simply sets the link register and the pc.  But it should really
>> construct a frame that looks like a simple C frame, so it should read
>> something like:
>>
>> simulateCallOf: address nextpc: nextpc memory: aMemory
>> "Simulate a frame-building call of address.  Build a frame since
>>  a) this is used for calls into the run-time which are unlikely to be
>> leaf-calls, and
>> b) stack alignment needs to be realistic for assert checking for
>> platforms such as Mac OS X.
>>  N.B. r11 is typically the platform's frame pointer, if it uses one."
>> self pushWord: nextpc in: aMemory.
>>  self pushWord: self r11 in: aMemory.
>> self r11: self sp.
>> self pc: address
>>
>>
>> I don't know the details (e.g. whether frames do save a frame pointer,
>> and whether r112 is used for the frame pointer).  Copy the platform's C
>> compiler.
>>
>> Now simulateCallOf:nextpc:memory: pairs with simulateReturnIn:.
>>
>> Further, simulateLeafCallOf:nextpc:memory: pairs
>> with simulateLeafReturnIn: and these are easy; you've already implemented
>> the first:
>>
>> simulateLeafCallOf: address nextpc: nextpc memory: aMemory
>> self lr: nextpc.
>> self pc: address
>>
>> and that should pair with:
>> simulateLeafReturnIn: aMemory
>> self pc: self lr
>>
>> Can I cc this to vm-dev?
>>
>>
>> 2013/01/29 9:48 pm Eliot Miranda <eliot.miranda at gmail.com><eliot.miranda at gmail.com>
>>> :
>>>
>>>
>>>
>>> On Tue, Jan 29, 2013 at 1:01 AM, Lars <lars.wassermann at googlemail.com>wrote:
>>>
>>>>  Hi Eliot,
>>>> it seems I am not solving the right problem. As far as I understood, we
>>>> have to support ARM abi, because the (gcc compiled) interpreter is expected
>>>> to be called that way. What we do within the (JIT compiled) machine code is
>>>> up to us.
>>>>
>>>> But how I understood your email is the opposite: The translated
>>>> interpreter will always adhere to IA32 abi, and only within machine code,
>>>> we want to push the LinkReg, etc.
>>>> How is that possible? Are there flags when compiling the c-code for ARM
>>>> to use IA32 abi instead?
>>>>
>>>> Or is my mental model still off?
>>>>
>>>
>>>  yes, but only a little :).  there are four calling conventions to
>>> think about, one virtual.  In no particular order they are:
>>>
>>>  One is the platform's C calling convention.  This is defined by the
>>> platform and not something we can decide.  It must be used whenever we cal
>>> a function in the interpreter, be it a run-time routine or a primitive.
>>>  Most of the run-time routines are called through trampolines and these
>>> trampolines must convert their input arguments into a valid call on the
>>> relevant interpreter routine according to the platform ABI.
>>>
>>>  Two is the trampoline calling convention(s).  This is purely
>>> register-based, and is used for generated machine-code to call the
>>> interpreter.  These are defined by the call instruction used to invoke
>>> them.  On X86 the return address will be passed on the stack.  On ARM it
>>> will be passed in the linkRegister (I think) and pushed there-in (in those
>>> trampolines that need to return back).
>>>
>>>  Three is the Smalltalk-to-Smalltalk calling convention used in sends,
>>> here, like two, n X86 the return address will be passed on the stack.  On
>>> ARM it will be passed in the linkRegister and pushed in frame-building
>>> code.  This convention is register-based for 0 and 1 argument sends, and
>>> both register-and-stack-based for > 1 argument sends (with the receiver and
>>> the class/selector always passed in a register).
>>>
>>>  Four is the virtual form of three, which is observed by the
>>> interpreter at various send failure points.  In this calling convention the
>>> return address is always passed on the stack, and is used by the
>>> interpreter to find the method or PIC in which a send has failed, and
>>> beneath that is the return address of the failing send call, which the
>>> interpreter uses to locate the send site that may be modified to maintain
>>> the inline cache.
>>>
>>>  So calling convention one is defined by the platform and we must
>>> adhere to it.
>>>
>>>  Calling convention two is a simple fast limited calling convention for
>>> a limited set of calls into the interpreter that insulates the generated
>>> machine code from the platform's calling conventions.
>>>
>>> Calling convention three is a simple fast calling convention used for
>>> machine-code Smalltalk-to-Smalltalk calls.
>>>
>>>  Calling convention four is a virtualization of calling convention
>>> three that insulated the interpreter from the processor's implementation of
>>> call and return instructions.
>>>
>>>  Does this resolve things?
>>>
>>>   Best, Lars
>>>>
>>>
>>>  cheers!
>>>
>>>
>>>>
>>>> 2013/01/28 10:28 pm Eliot Miranda <eliot.miranda at gmail.com><eliot.miranda at gmail.com>
>>>> :
>>>>
>>>> Hi Lars,
>>>>
>>>> On Sat, Jan 26, 2013 at 1:16 PM, Lars <lars.wassermann at googlemail.com>wrote:
>>>>
>>>>> Hello Eliot, hello vm-dev,
>>>>>
>>>>> @vm-dev: I'm still sometimes working on cog ARM, but due to my studies
>>>>> I have little time. The problem I'm working on is that IA32 has a different
>>>>> function call ABI than ARM. While on IA32, you need to push the return
>>>>> address, on ARM, you load it into the LR-register.
>>>>>
>>>>> A design decision to accommodate this difference in the ARM JIT was to
>>>>> use IA32 ABI within all cog code, even when running on ARM. Only when
>>>>> calling the (compiled) interpreter, we use ARM ABI. The hope was, that this
>>>>> way we need to change little of the existing code.
>>>>>
>>>>> @all: In the last days of working (spread across several months), I
>>>>> implemented the Call opcode (which is used by cogit whenever a function is
>>>>> called) by pushing the return address before branching to the target (IA32
>>>>> ABI).
>>>>> Also, I changed the trampoline generation to ask the compiler for the
>>>>> appropriate call opcode for the ABI (so far not committed), which is either
>>>>> Call in case of IA32 or BL in case of ARM. I'm not happy with that location
>>>>> for this behavior, but I don't know whether there exists a better place.
>>>>> Also, #hasLinkRegister is implemented on the compiler.
>>>>>
>>>>> Now, that calling the interpreter has changed, I run into the problem,
>>>>> that the simulator is expecting the stack pointer to point to the return
>>>>> address. The simulator is assuming IA32 ABI.
>>>>>
>>>>> How best to attribute for the changed ABI in the simulator?
>>>>>     Subclass the simulator? On which level, VMSimulator or
>>>>> VMSimulatorLSB? That change would be orthogonal to the LSB subclass (if
>>>>> there ever will be a MSB subclass).
>>>>>     Or introduce two classes which do know the ABI and are responsible
>>>>> for all places where ABI is used? Also the eventual changes to trampoline
>>>>> and enilopmart generation? Which problems might arise from this design
>>>>> decision with respect to the C-translation?
>>>>>
>>>>
>>>>
>>>>  I would take the same approach that Peter Deutsch took in HPS, the
>>>> VisualWorks VM.  The idea is to keep the Interpreter side of things
>>>> unchanged and change the glue code and/or the generated method prologue
>>>> code to keep the stack the same from the Interpreter's point of view.  So
>>>> when an ARM machine code method calls another ARM machine code method the
>>>> link register is in use, and the frame building code in a frame-building
>>>> non-leaf method pushes the link register as part of building the frame (as
>>>> one would expect), and a frameless method may be able to return through the
>>>> link register if it contains no runtime calls, but wold have to if it does
>>>> (*).  But if a machine-code method calls the run-time through glue it would
>>>> push the link register at some point before the glue call, leaving the
>>>> stack in the same state as it would be in the IA32 version at the same
>>>> point in execution.
>>>>
>>>>  For example, here's the prolog for a normal method, expressed in the
>>>> VM's assembler:
>>>>
>>>>  LstackOverflow:
>>>>  MoveCq: 0 R: ReceiverResultReg
>>>> LsendMiss:
>>>>  Call: ceMethodAbortTrampoline
>>>>  AlignmentNops: (BytesPerWord max: 8)
>>>> Lentry:
>>>>  objectRepresentation getInlineCacheClassTagFrom: ReceiverResultReg
>>>> into: TempReg
>>>>  CmpR: ClassReg R: TempReg
>>>>  JumpNonZero: LsendMiss:
>>>> LnoCheckEntry:
>>>>  ... frame bulding code ...
>>>>  MoveAw: coInterpreter stackLimitAddress R: TempReg
>>>>  CmpR: TempReg R: SPReg
>>>>  JumpBelow: LstackOverflow
>>>>
>>>>  The ceMethodAbort handles both the send miss when the inline cache
>>>> fails, and stack overflow at the end of a stack page or to check for
>>>> events.  The link register defnitely needs to be pushed for the send miss.
>>>>  It doesn't need to be pushed for the stack overflow (since frame build
>>>> code has already saved it in the return pc slot in the frame), but pushing
>>>> it unnecessarily can be undone by the glue for ceMethodAbortTrampoline.
>>>>
>>>>  So the abort code would become
>>>>
>>>>  LstackOverflow:
>>>>  MoveCq: 0 R: ReceiverResultReg
>>>> LsendMiss:
>>>>  Push: LinkReg
>>>>  Call: ceMethodAbortTrampoline
>>>>  AlignmentNops: (BytesPerWord max: 8)
>>>>  ...
>>>>
>>>>  and in ceMethodAbortTrampoline there would be a test
>>>> on ReceiverResultReg so that if ReceiverResultReg is 0 (the stack overflow
>>>> case) the link register is written to the same stack slot as it was pushed
>>>> to, so that the top of stack is the return address for
>>>> the ceMethodAbortTrampoline call, and if ReceiverResultReg is non-zero (the
>>>> send miss case), the link register is pushed, so that the inner return
>>>> address on top of stack is the return address for
>>>> the ceMethodAbortTrampoline call and the outer return address is that for
>>>> the send call that missed.  The return addresses are used to identify the
>>>> method (whose selector is the selector of the send) and the calsite at
>>>> which the send missed.
>>>>
>>>>  So with a little modification in the right places the Interpreter
>>>> sees exactly the same stack with ARM machine code as it does on IA32.  In
>>>> fact we can construct tests to ensure this is the case by running two VMs
>>>> side by side, running some test image that exercises the send machinery etc.
>>>>
>>>>  As far as the code codes it might look something like:
>>>>
>>>>  *Cogit methods for compile abstract instructions*
>>>>  *compileAbort*
>>>>  "*The start of a CogMethod has a call to a run-time abort routine
>>>> that either*
>>>> * handles an in-line cache failure or a stack overflow.  The routine
>>>> selects the*
>>>> * path depending on ReceiverResultReg; if zero it takes the stack
>>>> overflow*
>>>> * path; if nonzero the in-line cache miss path.  Neither of these
>>>> paths returns.*
>>>> * The abort routine must be called;  In the callee the method is
>>>> located by*
>>>> * adding the relevant offset to the return address of the call.*"
>>>>  stackOverflowCall := self MoveCq: 0 R: ReceiverResultReg.
>>>>  backEnd hasLinkRegister ifTrue:
>>>>  [self PushR: LinkReg].
>>>>  sendMissCall := self Call: (self methodAbortTrampolineFor:
>>>> methodOrBlockNumArgs)
>>>>
>>>>  StackToRegisterMappingCogit methods for initialization
>>>>  genMethodAbortTrampolineFor: numArgs
>>>>   "Generate the abort for a method.  This abort performs either a call
>>>> of ceSICMiss:
>>>>  to handle a single-in-line cache miss or a call of ceStackOverflow:
>>>> to handle a
>>>>  stack overflow.  It distinguishes the two by testing
>>>> ResultReceiverReg.  If the
>>>>  register is zero then this is a stack-overflow because a) the
>>>> receiver has already
>>>>  been pushed and so can be set to zero before calling the abort, and
>>>> b) the
>>>>  receiver must always contain an object (and hence be non-zero) on SIC
>>>> miss."
>>>>  | jumpSICMiss |
>>>>  <var: #jumpSICMiss type: #'AbstractInstruction *'>
>>>>  opcodeIndex := 0.
>>>>  self CmpCq: 0 R: ReceiverResultReg.
>>>>  jumpSICMiss := self JumpNonZero: 0.
>>>>  backEnd hasLinkRegister ifTrue:
>>>>  [self MoveR: LinkReg Mw: 0 r: SPReg]. "overwrite send ret address
>>>> with ceMethodAbortTrampoline call ret address"
>>>>  self compileTrampolineFor: #ceStackOverflow:
>>>>  callJumpBar: true
>>>>  numArgs: 1
>>>>  arg: SendNumArgsReg
>>>>  arg: nil
>>>>  arg: nil
>>>>  arg: nil
>>>>  saveRegs: false
>>>>  resultReg: nil.
>>>>  jumpSICMiss jmpTarget: self Label.
>>>>  backEnd hasLinkRegister ifTrue:
>>>>  [self PushR: LinkReg]. "push ret address for ceMethodAbortTrampoline
>>>> call"
>>>>   ...
>>>>
>>>>
>>>>  The same goes for the aborts in closed and open PICs.  Does this make
>>>> sense?
>>>> (*) I'm not sure without looking at the code carefully whether any
>>>> frameless methods can make calls on the runtime.  If not, then this issue
>>>> is moot.  If so, then one solution is to not compile the method frameless
>>>> if it makes use of the run-time.  Another approach would be to build a
>>>> simple frame (just push the link register).
>>>>
>>>>
>>>>> All the best,
>>>>> Lars
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>> best,
>>>> Eliot
>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>> best,
>>> Eliot
>>>
>>>
>>>
>>
>>
>> --
>> best,
>> Eliot
>>
>>
>>
>> --
>> best,
>> Eliot
>>
>>
>

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20130129/4c6c21db/attachment-0001.htm