[Vm-dev] Re: ARM Cog progress

Sat Jun 6 17:18:15 UTC 2015

On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge <tim at rowledge.org> wrote:

>
> On 06-06-2015, at 8:15 AM, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> >     so yesterday I finally switched on the Raspberry Pi Doug gave me as
> an xmas present, built the Spur ARM Cog VM and ... we definitely have a
> working VM.
>
> It’s really nice to get to this. There are still some ‘exciting’ parts to
> get working though… floating point for example.
>
> >  I was able to update a Spur image from mid February all the way to tip
> and run tests.  3751 run, 3628 passes, 24 expected failures, 89 failures,
> 10 errors, 0 unexpected passes
>
> Did this include the FloatMathPluginTests? Because on my Pi2 that
> segfaults in all versions of the vm - interpreter, stack, cog. Then again
> my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s
> PiB is just fine with that. Good old GCC strikes again.
>
> > Fun!  So I want to revisit the literal load question.
> > In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of
> 0x0-0xFFFF (0-65535).
> > The following table shows the range of 8-bit values that can be loaded
> in a single ARM MOV or MVN instruction (for data processing operations).
> The value to load must be a multiple of the value shown in the Step column.
> >
> Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back.
> One of the side-effects of the flexibility ARM provides to actual
> manufacturers is a fairly complex range of possible features within any
> particular architecture level.
>
> That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7
> features whilst using out of line data loads on the older machines. In the
> best case, where the data is already in the cache (we can use  PLD to help
> with that) a LDR takes 2 cycles as opposed to the 4 currently used by our
> mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always*
> two cycles with possibility of an out-of-cache delay, so I still think it
> is probably better.
>

Ha!  Turns out that at least for sends we're in the clear for out-of-line
literal load.  i.e. from
https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090

*Looking at the ARM1176jzf-s TRM, section "Cycle timings and interlock
behaviour" we see that:MOV Rn, x -> 1 cycleMVN Rn, x -> 1 cycle*
*LDR Rn, [PC, #constant] -> 1 cycle, with a latency of 3 cycles on Rn *

And the send sequence would look like

    LDR Rclass. [PC, #constant]
    BLX method.entry

with the entry code being

00001828: ands r0, r0, #1
0000182c: b 0x00001844
entry:
00001830: ands r0, r7, #3
00001834: bne 0x00001828
00001838: ldr r0, [r7]
0000183c: mvn ip, #0
00001840: ands r0, r0, ip, lsr #10
00001844: cmp r0, Rclass
00001848: bne 0x00001820
noCheckEntry:

i.e. we don't actually access the register loaded in the LDR for at least 7
cycles.  So it should work a lot better; 11 cycles vs 14 cycles for the
send sequence.  In fact the only code that should be impacted by the
latency is a conditional branch of a method result (we subtract true or
false from the result) or a constant assign.  Most of the time a literal
will be passed as an argument and there will be quite a few cycles before
it is used.

OK, so that implies doing the out-of-line literal load, with the advantage
that there's a single VM, and the same approach is used for the 64-bit ARM
system.

>
> tim
> --
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> Strange OpCodes: EIV: Erase IPL Volume
>
>
>

-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20150606/e1e92bd6/attachment.htm