[Vm-dev] Re: ARM Cog progress
eliot.miranda at gmail.com
Sat Jun 6 17:18:15 UTC 2015
On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge <tim at rowledge.org> wrote:
> On 06-06-2015, at 8:15 AM, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> > so yesterday I finally switched on the Raspberry Pi Doug gave me as
> an xmas present, built the Spur ARM Cog VM and ... we definitely have a
> working VM.
> It’s really nice to get to this. There are still some ‘exciting’ parts to
> get working though… floating point for example.
> > I was able to update a Spur image from mid February all the way to tip
> and run tests. 3751 run, 3628 passes, 24 expected failures, 89 failures,
> 10 errors, 0 unexpected passes
> Did this include the FloatMathPluginTests? Because on my Pi2 that
> segfaults in all versions of the vm - interpreter, stack, cog. Then again
> my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s
> PiB is just fine with that. Good old GCC strikes again.
> > Fun! So I want to revisit the literal load question.
> > In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of
> 0x0-0xFFFF (0-65535).
> > The following table shows the range of 8-bit values that can be loaded
> in a single ARM MOV or MVN instruction (for data processing operations).
> The value to load must be a multiple of the value shown in the Step column.
> Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back.
> One of the side-effects of the flexibility ARM provides to actual
> manufacturers is a fairly complex range of possible features within any
> particular architecture level.
> That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7
> features whilst using out of line data loads on the older machines. In the
> best case, where the data is already in the cache (we can use PLD to help
> with that) a LDR takes 2 cycles as opposed to the 4 currently used by our
> mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always*
> two cycles with possibility of an out-of-cache delay, so I still think it
> is probably better.
Ha! Turns out that at least for sends we're in the clear for out-of-line
literal load. i.e. from
*Looking at the ARM1176jzf-s TRM, section "Cycle timings and interlock
behaviour" we see that:MOV Rn, x -> 1 cycleMVN Rn, x -> 1 cycle*
*LDR Rn, [PC, #constant] -> 1 cycle, with a latency of 3 cycles on Rn *
And the send sequence would look like
LDR Rclass. [PC, #constant]
with the entry code being
00001828: ands r0, r0, #1
0000182c: b 0x00001844
00001830: ands r0, r7, #3
00001834: bne 0x00001828
00001838: ldr r0, [r7]
0000183c: mvn ip, #0
00001840: ands r0, r0, ip, lsr #10
00001844: cmp r0, Rclass
00001848: bne 0x00001820
i.e. we don't actually access the register loaded in the LDR for at least 7
cycles. So it should work a lot better; 11 cycles vs 14 cycles for the
send sequence. In fact the only code that should be impacted by the
latency is a conditional branch of a method result (we subtract true or
false from the result) or a constant assign. Most of the time a literal
will be passed as an argument and there will be quite a few cycles before
it is used.
OK, so that implies doing the out-of-line literal load, with the advantage
that there's a single VM, and the same approach is used for the 64-bit ARM
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> Strange OpCodes: EIV: Erase IPL Volume
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Vm-dev