We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.
3+4 does indeed = elephant. 100 factorial is a very long number. 1 tinyBenchmarks is utterly meaningless (gcc debug settings + lots of expensive runtime asserts) but still reports 120mbc/s and 7m sends/s or about 4x the stack vm.
Happy-happy.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful Latin Phrases:- Quantum materiae materietur marmota monax si marmota monax materiam possit materiari? = How much wood would a woodchuck chuck if a woodchuck could chuck wood?
That's fantastic news. Gotta fix 3+4 though, that should answer #cowTools.
--C
P.S.
Suddenly wondering if anyone has ever confused one our symbols for a hashtag...
On May 18, 2015, at 5:43 PM, tim Rowledge tim@rowledge.org wrote:
We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.
3+4 does indeed = elephant. 100 factorial is a very long number. 1 tinyBenchmarks is utterly meaningless (gcc debug settings + lots of expensive runtime asserts) but still reports 120mbc/s and 7m sends/s or about 4x the stack vm.
Happy-happy.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful Latin Phrases:- Quantum materiae materietur marmota monax si marmota monax materiam possit materiari? = How much wood would a woodchuck chuck if a woodchuck could chuck wood?
On 19.05.2015, at 03:19, Casey Ransberger casey.obrien.r@gmail.com wrote:
That's fantastic news. Gotta fix 3+4 though, that should answer #cowTools.
--C
P.S.
Suddenly wondering if anyone has ever confused one our symbols for a hashtag…
#Smalltalk. Hashtagging for 35 years.
- Bert -
On Mon, 18 May 2015 17:43:17 -0700 tim Rowledge tim@rowledge.org wrote:
We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.
Congrads, Tim!
Looking forward to running it on my Samsung ARM Chromebook!
Thanks for all the great work!
-KenD
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
-- View this message in context: http://forum.world.st/ARM-Cog-progress-tp4827195p4827779.html Sent from the Squeak VM mailing list archive at Nabble.com.
Hi Tim,
On May 21, 2015, at 12:47 AM, timfelgentreff timfelgentreff@gmail.com wrote:
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
TimR and I were talking about this yesterday. The current code generator targets ARMv5, and so works on Pi1.
Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions. ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing. I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
The temptation is to move to ARMv7 to get that more compact and faster literal generation. But it would mean either dropping Pi1 or two VMs. I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.
Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind. Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.
Eliot (phone)
-- View this message in context: http://forum.world.st/ARM-Cog-progress-tp4827195p4827779.html Sent from the Squeak VM mailing list archive at Nabble.com.
On Thu, 21 May 2015 05:58:41 -0700 Eliot Miranda eliot.miranda@gmail.com wrote:
Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions. ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing. I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
The temptation is to move to ARMv7 to get that more compact and faster literal generation. But it would mean either dropping Pi1 or two VMs. I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.
..but the test for ARM5/7/8/.. should happen once, and the codegen could be specialized at that time -- after which the ARM specialization code itself is no longer needed, so no bloat.
Dynamic specialization does work, right? ;^)
On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Tim,
On May 21, 2015, at 12:47 AM, timfelgentreff timfelgentreff@gmail.com wrote:
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
TimR and I were talking about this yesterday. The current code generator targets ARMv5, and so works on Pi1.
Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions. ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing. I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
The temptation is to move to ARMv7 to get that more compact and faster literal generation. But it would mean either dropping Pi1 or two VMs. I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.
Dart puts all object references off into a pool to avoid this.
Knowing what ARMv8 does for 64-bit literal synthesis would help me make up
my mind. Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.
Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
Hi Ryan, Hi Tim,
On May 21, 2015, at 8:55 AM, Ryan Macnak rmacnak@gmail.com wrote:
On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Tim,
On May 21, 2015, at 12:47 AM, timfelgentreff timfelgentreff@gmail.com wrote:
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
TimR and I were talking about this yesterday. The current code generator targets ARMv5, and so works on Pi1.
Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions. ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing. I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
The temptation is to move to ARMv7 to get that more compact and faster literal generation. But it would mean either dropping Pi1 or two VMs. I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.
Dart puts all object references off into a pool to avoid this.
Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind. Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.
Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
So out-of-line = 12 bytes vs in-line = 16 bytes. For me, given that ARM has always supported out-of-line, and it should have good performance, I'd go for out-of-line. But it's performance could be much worse. Anyone have any numbers?
Eliot (phone)
On 21-05-2015, at 10:19 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
So out-of-line = 12 bytes vs in-line = 16 bytes. For me, given that ARM has always supported out-of-line, and it should have good performance, I'd go for out-of-line. But it's performance could be much worse. Anyone have any numbers?
Also no absolute need to do 64bit oops with AArch64. It will be quite happy to do 32 bit oops. So 2x 16bit chunks would be fine for both that and v7. So far as I can work out all the operations can work in 32bit quantities, even rotations/shifts/compare.
And there is the hilarious concept of conditional comparisons - if the condition flags match a vector of condition flags, then do a compare of some sort and if that is true, set the flags as appropriate, otherwise set the flags to another vector. I’d love to see the logic that persuaded them to do that.
I swear I spotted a WTF instruction in there somewhere.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim State-of-the-art: What we could do with enough money.
2015-05-21 19:42 GMT+02:00 tim Rowledge tim@rowledge.org:
On 21-05-2015, at 10:19 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
So out-of-line = 12 bytes vs in-line = 16 bytes. For me, given that ARM
has always supported out-of-line, and it should have good performance, I'd go for out-of-line. But it's performance could be much worse. Anyone have any numbers?
Also no absolute need to do 64bit oops with AArch64. It will be quite happy to do 32 bit oops. So 2x 16bit chunks would be fine for both that and v7. So far as I can work out all the operations can work in 32bit quantities, even rotations/shifts/compare.
And there is the hilarious concept of conditional comparisons - if the condition flags match a vector of condition flags, then do a compare of some sort and if that is true, set the flags as appropriate, otherwise set the flags to another vector. I’d love to see the logic that persuaded them to do that.
I swear I spotted a WTF instruction in there somewhere.
Also a ROTFL I think it just matches the conversion from SmallDouble to native double that Eliot naivly coded with many more instructions.
tim
-- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim State-of-the-art: What we could do with enough money.
On May 21, 2015, at 10:42 AM, tim Rowledge tim@rowledge.org wrote:
I swear I spotted a WTF instruction in there somewhere.
Every ISA should have a WTF instruction. It should ideally be a noop, but more likely does something really ambitious completely wrong.
Hah.
--C
Hi Doug, Hi Tim, Hi All,
so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM. I was able to update a Spur image from mid February all the way to tip and run tests. 3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes Fun! So I want to revisit the literal load question.
Doug got me a Pi 1 B. cat /proc/cpuinfo reveals processor : 0 model name : ARMv6-compatible processor rev 7 (v6l) Features : swp half thumb fastmult vfp edsp java tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xb76 CPU revision : 7
Hardware : BCM2708 Revision : 000e Serial : 00000000fe7b08eb
And the ARM Assembler User Guide http://www.keil.com/support/man/docs/armasm/armasm_dom1359731146222.htm says (emphasis added) 4.4 Load immediate values using MOV and MVN The MOV and MVN instructions can write a range of immediate values to a register.
In ARM state: MOV can load any 8-bit immediate value, giving a range of 0x0-0xFF (0-255). It can also rotate these values by any even number. These values are also available as immediate operands in many data processing operations, without being loaded in a separate instruction. MVN can load the bitwise complements of these values. The numerical values are -(n+1), where n is the value available in MOV. *In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535).* The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
So it looks to me that the right approach is to add an ARMv6 subclass to CogARMInstruction that uses the 16-bit literal load instructions and use that as our standard 32-bit ARM code generator. But I'm ignorant as to the processor versions used in the Raspberry Pi. Are all RPis ARMv6? What exactly is ARMv6T2? I'm guessing that T2 refers to Thunmb2, is that correct? And on the specific question can anyone think of a good reason /not/ to use the 16-bit literal load approach?
On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Tim,
On May 21, 2015, at 12:47 AM, timfelgentreff timfelgentreff@gmail.com wrote:
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
TimR and I were talking about this yesterday. The current code generator targets ARMv5, and so works on Pi1.
Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions. ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing. I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
The temptation is to move to ARMv7 to get that more compact and faster literal generation. But it would mean either dropping Pi1 or two VMs. I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.
Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind. Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.
Eliot (phone)
-- View this message in context:
http://forum.world.st/ARM-Cog-progress-tp4827195p4827779.html
Sent from the Squeak VM mailing list archive at Nabble.com.
On 06-06-2015, at 8:15 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.
It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.
I was able to update a Spur image from mid February all the way to tip and run tests. 3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes
Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.
Fun! So I want to revisit the literal load question. In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535). The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.
That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7 features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: EIV: Erase IPL Volume
On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge tim@rowledge.org wrote:
On 06-06-2015, at 8:15 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
so yesterday I finally switched on the Raspberry Pi Doug gave me as
an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.
It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.
I was able to update a Spur image from mid February all the way to tip
and run tests. 3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes
Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.
Fun! So I want to revisit the literal load question. In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of
0x0-0xFFFF (0-65535).
The following table shows the range of 8-bit values that can be loaded
in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.
Damn, you're right. gcc with the -march=armv6t2 option will generate 16-bit literal loads, e.g.
long it() { return 0x1A2B3C4D; }
=>
.arch armv6t2 ... .text .align 2 .global it .type it, %function it: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movw r0, #15437 movt r0, 6699 bx lr
but compiling, linking and running does indeed signal Illegal instruction. That's /my/ weekend ruined ;-)
That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7
features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.
Except that in 64-bits don't we end up with 6 cycles (2 x MOVT/H plus a shift and an add, or maybe 5 cycles if MOVT/H leave other bits undisturbed) vs 2 for the out-of-line literal load? In which case, the out-of-line is a clear win for 64-bits and that's likely our most important target, given the ubiquity of smart phones.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: EIV: Erase IPL Volume
On 06-06-2015, at 10:03 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Except that in 64-bits don't we end up with 6 cycles (2 x MOVT/H plus a shift and an add, or maybe 5 cycles if MOVT/H leave other bits undisturbed) vs 2 for the out-of-line literal load? In which case, the out-of-line is a clear win for 64-bits and that's likely our most important target, given the ubiquity of smart phones.
Let’s not forget that the v8 ARMs can (apparently) happily do 32bit data stuff; even the rotates and shifts can behave correctly for 32 bit. So we could use the same 32 bit image format and save some space, which may have some value for small machines like phones.
I can’t believe I’m referring to things with quadcore 64 bit cpus and 1/2/4Gb ram as small machines...
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: CMN: Convert to Mayan Numerals
On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge tim@rowledge.org wrote:
On 06-06-2015, at 8:15 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
so yesterday I finally switched on the Raspberry Pi Doug gave me as
an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.
It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.
I was able to update a Spur image from mid February all the way to tip
and run tests. 3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes
Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.
Fun! So I want to revisit the literal load question. In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of
0x0-0xFFFF (0-65535).
The following table shows the range of 8-bit values that can be loaded
in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.
That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7 features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.
Ha! Turns out that at least for sends we're in the clear for out-of-line literal load. i.e. from https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090
*Looking at the ARM1176jzf-s TRM, section "Cycle timings and interlock behaviour" we see that:MOV Rn, x -> 1 cycleMVN Rn, x -> 1 cycle* *LDR Rn, [PC, #constant] -> 1 cycle, with a latency of 3 cycles on Rn *
And the send sequence would look like
LDR Rclass. [PC, #constant] BLX method.entry
with the entry code being
00001828: ands r0, r0, #1 0000182c: b 0x00001844 entry: 00001830: ands r0, r7, #3 00001834: bne 0x00001828 00001838: ldr r0, [r7] 0000183c: mvn ip, #0 00001840: ands r0, r0, ip, lsr #10 00001844: cmp r0, Rclass 00001848: bne 0x00001820 noCheckEntry:
i.e. we don't actually access the register loaded in the LDR for at least 7 cycles. So it should work a lot better; 11 cycles vs 14 cycles for the send sequence. In fact the only code that should be impacted by the latency is a conditional branch of a method result (we subtract true or false from the result) or a constant assign. Most of the time a literal will be passed as an argument and there will be quite a few cycles before it is used.
OK, so that implies doing the out-of-line literal load, with the advantage that there's a single VM, and the same approach is used for the 64-bit ARM system.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: EIV: Erase IPL Volume
On 06-06-2015, at 10:18 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Ha! Turns out that at least for sends we're in the clear for out-of-line literal load. i.e. from https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090
Excellent. So the only ‘fun’ is creating, managing and accessing the pools of out of line constants. Where should we place the pool though? My first thought was just in front of the ‘entry’ address but that would screw the assorted entry/nocheck offsets we have as constants. At the end, just before the metadata?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim The Static Typing Philosophy: Make it fast. Make it right. Make it run.
GCC is such fun.
Cog VM built on my Pi2, like all the ones built whilst developing this thing, with -O2. Segfaults very early on Pi2; trying to run under gdb segfaults at pc=0, which really is clever and remarkably effectively obfuscates all the information you might hope to glean. But copy that executable to an old Pi B+ and it runs perfectly happily.
Mind you, we currently compile the cogit file with no errors nor even warnings, so perhaps it’s simple revenge?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim How come it's 'Java One' every year? Aren't they making any progress?
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
On 07-06-2015, at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
I dunno; I get a total of 59 warnings when compiling the ARM Cog VM, none in the core vm code. I’d love to see it be 0. I rather suspect the world would come to an end if that happened.
On the other hand, I can’t see how it is acceptable for a compiler to produce code that blows up at one level of optimisation but not at another. Come to that I’m not sure why there are different levels; I can sort of see asking to optimise in different ways - the NorCroft compiler for ARM can be asked to optimise for runtime speed or executable size, for example.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim In /dev/null no one can hear you scream
On 07 Jun 2015, at 17:58, tim Rowledge tim@rowledge.org wrote:
On the other hand, I can’t see how it is acceptable for a compiler to produce code that blows up at one level of optimisation but not at another. Come to that I’m not sure why there are different levels; I can sort of see asking to optimise in different ways - the NorCroft compiler for ARM can be asked to optimise for runtime speed or executable size, for example.
Well, maybe look at the documentation. -O2 will enable certain optimization passes that are legal according to the C specification but still are a trade-off (e.g. between codesize and speed, being able to debug and speed, etc).
E.g. code like the below:
int a, b; b = 10; /* a = b; */ a += 2; return a;
might produce the result ’12’ in some optimizations and not in others. It might be twelve because the register allocators allocates the same register for both ‘a’ and ‘b’ because the liveliness analysis has shown that usage of ‘a’ and ‘b’ are not in the same basic blocks. Now complaining that the compiler doesn’t produce ’12’ all the time would not be a smart thing to do.
GCC and other compilers (like any piece of software) have issues but you seem to jump the gun. Identify the “file”/code that gets “miscompiled” and then see if the code is valid C or a compiler issue. If it is a compiler issue one can report a bug to the GCC folks (in recent years they gave a good track record of caring).
holger
On 07.06.2015, at 17:58, tim Rowledge tim@rowledge.org wrote:
On 07-06-2015, at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
I dunno; I get a total of 59 warnings when compiling the ARM Cog VM, none in the core vm code. I’d love to see it be 0. I rather suspect the world would come to an end if that happened.
On the other hand, I can’t see how it is acceptable for a compiler to produce code that blows up at one level of optimisation but not at another. Come to that I’m not sure why there are different levels; I can sort of see asking to optimise in different ways - the NorCroft compiler for ARM can be asked to optimise for runtime speed or executable size, for example.
not so uncommon:
gcc: -O0 no opt -O1 'Optimize' (= -O) -O2 'Optimize even more' -O3 'Optimize yet more' -Os 'Optimize for size' (like -O2 minus size-increasing opts) -Ofast 'Disregard strict standards compliance'
clang: as gcc, plus
-Oz 'Like -Os (and thus -O2), but reduces code size further' (-O4 'currently like -O3')
-flto 'Generate output files in LLVM formats, suitable for link time optimization'
Also: a lot of -f… flags
Best regards -Tobias
On 07 Jun 2015, at 18:35, Tobias Pape Das.Linux@gmx.de wrote:
-O0 no opt -O1 'Optimize' (= -O) -O2 'Optimize even more' -O3 'Optimize yet more' -Os 'Optimize for size' (like -O2 minus size-increasing opts) -Ofast 'Disregard strict standards compliance’
-Og ‘Optimize for debugging’ combined with -ggdb3
On Sun, Jun 7, 2015 at 8:58 AM, tim Rowledge tim@rowledge.org wrote:
On 07-06-2015, at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you
considered
that it might be the input given to gcc that is the issue here? ;)
I dunno; I get a total of 59 warnings when compiling the ARM Cog VM, none in the core vm code. I’d love to see it be 0. I rather suspect the world would come to an end if that happened.
On the other hand, I can’t see how it is acceptable for a compiler to produce code that blows up at one level of optimisation but not at another. Come to that I’m not sure why there are different levels; I can sort of see asking to optimise in different ways - the NorCroft compiler for ARM can be asked to optimise for runtime speed or executable size, for example.
I'm inclined to believe Cog is relying on undefined behavior somewhere, and this isn't gcc's fault. I've not built functional Newspeak VMs on modern compilers since ~1317, but this overlaps with the VM being broken on old compilers for other reasons so I haven't figured out what change is responsible.
On Mon, Jun 8, 2015 at 7:54 PM, Ryan Macnak rmacnak@gmail.com wrote:
On Sun, Jun 7, 2015 at 8:58 AM, tim Rowledge tim@rowledge.org wrote:
On 07-06-2015, at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you
considered
that it might be the input given to gcc that is the issue here? ;)
I dunno; I get a total of 59 warnings when compiling the ARM Cog VM, none in the core vm code. I’d love to see it be 0. I rather suspect the world would come to an end if that happened.
On the other hand, I can’t see how it is acceptable for a compiler to produce code that blows up at one level of optimisation but not at another. Come to that I’m not sure why there are different levels; I can sort of see asking to optimise in different ways - the NorCroft compiler for ARM can be asked to optimise for runtime speed or executable size, for example.
I'm inclined to believe Cog is relying on undefined behavior somewhere, and this isn't gcc's fault. I've not built functional Newspeak VMs on modern compilers since ~1317, but this overlaps with the VM being broken on old compilers for other reasons so I haven't figured out what change is responsible.
Hi Ryan,
IMO the likely issue is register usage in trampoline calls. The JIT tries to reduce register saving and restoring across trampoline calls by using a notion of the ABI's caller-saved registers. Callee-saved registers shouldn't be an issue because either a run-time call returns to the same trampoline that invoked it, hence restoring callee-saved registers, or enters machine code via an enlopmart which assumes no registers are live and restores any and all registers as appropriate. But there could be bugs here, and certainly gcc could change over versions, perhaps becoming more aggressive in register saving, and surfacing previously undetected bugs here.
One thing to do is compare a StackInterpreter VM against Cog, at least to locate the blame. Then, if the finger does point at the Cogit, to locate the issue after setting up a reproducible case, run with some kind of tracing (e.g. each message selector, but the Cogit could straight-forwardly add tracing to the trampolines) to see what the VM is doing immediately before the crash.
On Tue, Jun 9, 2015 at 8:36 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Ryan,
IMO the likely issue is register usage in trampoline calls. The JIT
tries to reduce register saving and restoring across trampoline calls by using a notion of the ABI's caller-saved registers. Callee-saved registers shouldn't be an issue because either a run-time call returns to the same trampoline that invoked it, hence restoring callee-saved registers, or enters machine code via an enlopmart which assumes no registers are live and restores any and all registers as appropriate. But there could be bugs here, and certainly gcc could change over versions, perhaps becoming more aggressive in register saving, and surfacing previously undetected bugs here.
I've been using the same compiler versions though, did this change recently in Cog?
The runtime entries should be marked "extern". Compare a VM that works at -O3. https://github.com/dart-lang/sdk/blob/6542a451c38c650a5ce9323e474982384a5daf31/runtime/vm/runtime_entry.h#L67
Some quick and dirty sed hacking and this partially fixes NSVM on clang 3.4 (can now complete the test suite without crashing about half of the time).
find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/void ce/extern void ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/void (*ce/extern void (*ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/sqInt ce/extern sqInt ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/VM_EXPORT extern/extern/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/static extern/static/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/extern sqInt cesoRetAddr/sqInt cesoRetAddr/'
One thing to do is compare a StackInterpreter VM against Cog, at least to
locate the blame.
The stack VM is fine.
Hi Ryan,
On Tue, Jun 9, 2015 at 10:58 PM, Ryan Macnak rmacnak@gmail.com wrote:
On Tue, Jun 9, 2015 at 8:36 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Ryan,
IMO the likely issue is register usage in trampoline calls. The JIT
tries to reduce register saving and restoring across trampoline calls by using a notion of the ABI's caller-saved registers. Callee-saved registers shouldn't be an issue because either a run-time call returns to the same trampoline that invoked it, hence restoring callee-saved registers, or enters machine code via an enlopmart which assumes no registers are live and restores any and all registers as appropriate. But there could be bugs here, and certainly gcc could change over versions, perhaps becoming more aggressive in register saving, and surfacing previously undetected bugs here.
I've been using the same compiler versions though, did this change recently in Cog?
The runtime entries should be marked "extern". Compare a VM that works at -O3. https://github.com/dart-lang/sdk/blob/6542a451c38c650a5ce9323e474982384a5daf31/runtime/vm/runtime_entry.h#L67
OK, so the compiler is interpreting "void foo(decl)" differently to "extern void foo(decl); void foo(decl)". Back in the day these were equivalent and C compilers only teated "static" as meaningful. e.g. The C Programming Language, 2nd ed, sec A11.2 Linkage, p 228: "As discussed in §A10.2, the first external declaration for an identifier gives the identifier internal linkage if the static specifier is used, external linkage otherwise." but I guess providing optimized intra-compilation-unit linkage produces some benefits. Anyway, see VMMaker.oscog-eem.1349 which now spits out extern or static for each function declaration. I'll commit C source soon.
Some quick and dirty sed hacking and this partially fixes NSVM on clang 3.4 (can now complete the test suite without crashing about half of the time).
find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/void ce/extern void ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/void (*ce/extern void (*ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/sqInt ce/extern sqInt ce/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/VM_EXPORT extern/extern/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/static extern/static/' find ns*src -name '*.c' -o -name '*.h' -print0 | xargs -0 sed -i '' -e 's/extern sqInt cesoRetAddr/sqInt cesoRetAddr/'
One thing to do is compare a StackInterpreter VM against Cog, at least to
locate the blame.
The stack VM is fine.
Hi Holger,
On Sun, Jun 7, 2015 at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
You're very welcome to make changes to plugins to reduce warnings. I'm focussed on the core VM, and as Tim has said there are almost no warnings from that code. There is one warning from the Cogit that is inappropriate and I refuse to waste the code that would avoid it, and there are, I think, 8 warnings from the 32-bit Stack/CoInterpreter which are in integer conversion code which is conditionally compiled code not used in 32-bits. So in the code I have responsibility for I have eliminated all but the minimum of warnings. I am not the author of the plugins and don't presume to understand them all well enough to fix them. I would appreciate your help, rather than your criticism.
2015-06-07 20:24 GMT+02:00 Eliot Miranda eliot.miranda@gmail.com:
Hi Holger,
On Sun, Jun 7, 2015 at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
You're very welcome to make changes to plugins to reduce warnings. I'm focussed on the core VM, and as Tim has said there are almost no warnings from that code. There is one warning from the Cogit that is inappropriate and I refuse to waste the code that would avoid it, and there are, I think, 8 warnings from the 32-bit Stack/CoInterpreter which are in integer conversion code which is conditionally compiled code not used in 32-bits. So in the code I have responsibility for I have eliminated all but the minimum of warnings. I am not the author of the plugins and don't presume to understand them all well enough to fix them. I would appreciate your help, rather than your criticism.
I'd like to help in this domain too. Unfortunately, eliminating warnings is a first step, but not enough (even with -Wall -Wextra ...) Many UB conditions are not detected (or the compiler would really be pedantic with false alarms). For example, like testing overflow in post-condition which is wrong since overflow is UB, anything could happen... see bytecodePrimeMultiply http://smallissimo.blogspot.fr/2015/04/the-more-or-less-defined-behavior-we....
I thing clang has some optional analyzer (as the one used in Xcode, but there might be more capabilities...)
Nicolas
-- best, Eliot
Hi Nicolas,
On Sun, Jun 7, 2015 at 11:50 AM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
2015-06-07 20:24 GMT+02:00 Eliot Miranda eliot.miranda@gmail.com:
Hi Holger,
On Sun, Jun 7, 2015 at 12:32 AM, Holger Freyther holger@freyther.de wrote:
On 07 Jun 2015, at 01:25, tim Rowledge tim@rowledge.org wrote:
GCC is such fun.
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
You're very welcome to make changes to plugins to reduce warnings. I'm focussed on the core VM, and as Tim has said there are almost no warnings from that code. There is one warning from the Cogit that is inappropriate and I refuse to waste the code that would avoid it, and there are, I think, 8 warnings from the 32-bit Stack/CoInterpreter which are in integer conversion code which is conditionally compiled code not used in 32-bits. So in the code I have responsibility for I have eliminated all but the minimum of warnings. I am not the author of the plugins and don't presume to understand them all well enough to fix them. I would appreciate your help, rather than your criticism.
I'd like to help in this domain too. Unfortunately, eliminating warnings is a first step, but not enough (even with -Wall -Wextra ...) Many UB conditions are not detected (or the compiler would really be pedantic with false alarms). For example, like testing overflow in post-condition which is wrong since overflow is UB, anything could happen... see bytecodePrimeMultiply http://smallissimo.blogspot.fr/2015/04/the-more-or-less-defined-behavior-we....
You're quite right, and one area that is crucial with a JIT is calls from machine code into the C runtime, which must obey the ABI rules. None of the mistakes in this area will show up as C compiler warnings because they are to do with the code the JIT generates. ANother issue is instruction cache flushing, and I suspect that's where the difference between the 900MHz Pi 2 that Tim is using, and that is crashing, and the apparently reliable 700MHz B+ that I'm using is.
I thing clang has some optional analyzer (as the one used in Xcode, but there might be more capabilities...)
Nicolas
-- best, Eliot
On 07 Jun 2015, at 20:24, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Holger,
Dear Eliot,
given the amount of warnings emitted during compilation, have you considered that it might be the input given to gcc that is the issue here? ;)
You're very welcome to make changes to plugins to reduce warnings. I'm focussed on the core VM, and as Tim has said there are almost no warnings from that code. There is one warning from the Cogit that is inappropriate and I refuse to waste the code that would avoid it, and there are, I think, 8 warnings from the 32-bit Stack/CoInterpreter which are in integer conversion code which is conditionally compiled code not used in 32-bits. So in the code I have responsibility for I have eliminated all but the minimum of warnings. I am not the author of the plugins and don't presume to understand them all well enough to fix them. I would appreciate your help, rather than your criticism.
Tim is in the believe that the C compiler is wrong. From my experience in most cases (yes, I have seen miscompilation on ARM as well) it is the compiler. The right way is to understand what is actually wrong and then point fingers. This is why I challenged Tim.
In terms of compiler warnings. It is great that you have improved the situation and I very much appreciate it as a user of the VM.
holger
On 07-06-2015, at 11:51 AM, Holger Freyther holger@freyther.de wrote:
Tim is in the believe that the C compiler is wrong. From my experience in most cases (yes, I have seen miscompilation on ARM as well) it is the compiler. The right way is to understand what is actually wrong and then point fingers. This is why I challenged Tim.
I’ve a bit of experience in using C & gcc on ARM to do Smalltalk things; something around 30 years at a guess, and pretty much none of it has endeared me to it. The NorCroft C compiler as used on RISC OS hasn’t ever messed me around in the way gcc frequently does. YMMV.
But yes, there is almost certainly something in the source code that has triggered this but since it runs on x86 ok and the *same* executable runs on a Pi B, I’m definitely tending to the ‘damn gcc bites me again’ direction.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Programmers do it bit by bit.
On 07-06-2015, at 11:24 AM, Eliot Miranda eliot.miranda@gmail.com wrote: I'm focussed on the core VM, and as Tim has said there are almost no warnings from that code.
Better than ‘almost no’ - a single warning about sigsetjmp.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim "Bother" said Pooh, and deleted his message base
If I build on the pi2 with -O1 instead of O2 the vm works apparently as well as it did early last week on O2. At least this means I can keep working!
Once I have a running version I can run the SUnit test suite except for the FloatMathPluginTests which blow up several alternate Earths every time. The interesting observation this evening is that after starting the tests on one of my Pi B+’s and watching for five minutes or so to see if it would be happy, I started the tests on one of my Pi2’s. Within a few minutes (yes, I‘m being oh so precise, I know) the Pi2 had caught up and passed the number of tests the B+ had got to. A few minutes later it was waaaay ahead.
This is interesting because in raw low-level benchmarks the Pi2 isn’t very much faster (~20-30%) than the B+ units. So I’ve just run Eliot’s favourite benchmarks, the ShootoutTests in the CogBenchmark package. Comparing B+ to 2 we see nbody 415 168 2.5X bintree 88 46 1.9X chredux 313 86 3.6X thread 335 111 3X
Remember, this is the same vm, same image.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- An 8086 in a StrongARM environment.
On 06-06-2015, at 9:13 PM, tim Rowledge tim@rowledge.org wrote:
This is interesting because in raw low-level benchmarks the Pi2 isn’t very much faster (~20-30%) than the B+ units. So I’ve just run Eliot’s favourite benchmarks, the ShootoutTests in the CogBenchmark package. Comparing B+ to 2 we see
Extending this to list the ‘plain interpreter vm’ which on the Pi is actually a fairly old 4.10.2-2793 vm, and running a recent-ish trunk image -
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Programmer: One who is too lacking in people skills to be a software engineer.
Err, there is supposed to be a pdf in that last email and it may display or may not…
On 07-06-2015, at 5:27 PM, tim Rowledge tim@rowledge.org wrote:
<Pi Performance tests.pdf>
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Got his brains as a stocking stuffer.
On Sun, Jun 07, 2015 at 05:29:25PM -0700, tim Rowledge wrote:
Err, there is supposed to be a pdf in that last email and it may display or may not?
On 07-06-2015, at 5:27 PM, tim Rowledge tim@rowledge.org wrote:
<Pi Performance tests.pdf>
Yes it came through fine.
On Sat, Jun 6, 2015 at 10:38 AM, tim Rowledge tim@rowledge.org wrote:
On 06-06-2015, at 10:18 AM, Eliot Miranda eliot.miranda@gmail.com wrote:
Ha! Turns out that at least for sends we're in the clear for
out-of-line literal load. i.e. from https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090
Excellent. So the only ‘fun’ is creating, managing and accessing the pools of out of line constants. Where should we place the pool though? My first thought was just in front of the ‘entry’ address but that would screw the assorted entry/nocheck offsets we have as constants. At the end, just before the metadata?
The idea is to dump them somewhere inconspicuours. The natural place is at the head of an else block, i.e. after an unconditional forward branch, or immediately following a return. Here's Object>>printOn: as an example:
41 <70> self 42 <C7> send: class 43 <D0> send: name 44 <69> popIntoTemp: 1 45 <10> pushTemp: 0 46 <88> dup 47 <11> pushTemp: 1 48 <D5> send: first 49 <D4> send: isVowel 50 <99> jumpFalse: 53 51 <23> pushConstant: 'an ' 52 <90> jumpTo: 54
**put literals here**
53 <22> pushConstant: 'a ' 54 <E1> send: nextPutAll: 55 <87> pop 56 <11> pushTemp: 1 57 <E1> send: nextPutAll: 58 <87> pop 59 <78> returnSelf
**and put literals here**
The pushConstant: 'a ' at 53 is only reached from the jump at 50. So dumping literals after the jump at 52 is good, as is after the final return.
However, if code is jump-less and return-less and refers to lots of literals a jump past a run of literals can be inserted as an emergency measure.
On 21-05-2015, at 12:47 AM, timfelgentreff timfelgentreff@gmail.com wrote:
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
Yes, and no. Yet.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- A prime candidate for natural deselection.
vm-dev@lists.squeakfoundation.org