[Vm-dev] Aarch64 number bug
Tobias Pape
Das.Linux at gmx.de
Mon Jul 5 18:32:10 UTC 2021
Hi Eliot
> On 2. Jul 2021, at 16:07, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> […]
>
> From what I see, tho, is that the MUL and the SMULH instruction order has been reversed.
>
> Indeed. I created a regression. Apologies for all this effort over my mistake.
This happens :)
>
>>
>> (Also, the arm blog uses an different interesting method of detecting overflow,
>> by missusing the argument shitfter of CMP: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/detecting-overflow-from-mul
>> Although this is written for older arm, it seems to be applicable to the 64-variant too)
>>
> So here's how I decided to do it on AArch64. Maybe you can find something better.
>
> AArch64 implement 64 x 64 => 128 bit multiply by providing a 64x64 => low 64 bits and a 64x64 => high 64 bits multiply instruction. So if a 64x64=>128 multiply has not overflowed the high 64 bits will be an extension of the sign bit of the 64x64=>low 64 instruction.
Yes.
> i.e. high 64 will either be all 0s or all 1s (all 1s = -1).
Yes
> So if one subtracts the sign bit from the high 64 result one gets 0 if there is no overflow. Here's the instruction sequence:
>
> smulh x9/RISCTemp, x4/Class, x8/Arg1
> mul x4/Class, x4/Class, x8/Arg1
> lsr x1, x4/Class, #63
Ok, cases are:
OVF x9 x4[63] x1
1. N ff.. 1 1
2. Y ff.. 0 0
3. Y 00.. 1 1
4. N 00.. 0 0
5. Y 1-7f.. 1 1
6. Y 1-7f.. 0 0
x1 is either 0 or 1, because LSR shifts in 0.
> adds x9/RISCTemp, x9/RISCTemp, x1
OVF x9 x4[63] x1 x9' V Z N
1. N ff.. 1 1 0..0 1 1 0
2. Y ff.. 0 0 ff.. 0 0 1
3. Y 00.. 1 1 0..1 0 0 0
4. N 00.. 0 0 0..0 0 1 0
5a. Y 1-7f.e 1 1 2-7f.f 0 0 0
5b. Y 7f..f 1 1 80..0 0 0 1
6. Y 1-7f.f 0 0 1-7f.f 0 0 0
> b.ne 0x44b8 // b.any = *@88 (to overflow)
B.NE branches on Z = 0
== So, yes, this does what you want. ==
Let me also try to adapt what the blog post suggested:
smulh x9/RISCTemp, x4/Class, x8/Arg1
mul x4/Class, x4/Class, x8/Arg1
cmp x9/RISCTemp, x4/Class, ASR #63
b.ne // on overflow
With the ASR modifier on x4 effectively propagating the signbit all through the LSB just for this operation
OVF x9 x4[63] x4 ASR intern Z
1. N ff.. 1 ff..f 00.. 1
2. Y ff.. 0 00..0 ff.. 0
3. Y 00.. 1 ff..f 00..01 0
4. N 00.. 0 00..0 00.. 1
5. Y 1-7f.. 1 ff..f 2-80..0 0
6. Y 1-7f.. 0 00..0 1-7ff.. 0
You would generate that as:
CMP(ASR) =
2r1101011100 << 22
+ (reg1 << 16) "x4/Class"
+ (63 << 10) "ASR shift amount"
+ (reg2 << 5) "x9/RISCTemp"
+ XZR "This differentiates CMP from SUBS"
This has the following properties:
- its just 3 concrete ARM instruction, so no need to split it.
- After the sequence:
- x4/Class has (x4/Class * x8/Arg1)
- x9/RISCTemp is either all 00..00 or all ff..ff
- Z == 1 on no overflow, Z == 0 on overflow.
- It "speaks" "compare all x9 bits to bit 63 of x4"
=====
Both things get the job done.
I think the original lsr+adds can be improved, tho.
First, you can push the lsr op into the shifter-arg of adds.
Second, you can use xzr as dest reg, which is equivalent to cmn (compare neg). [1]
But whether to compare against the (right-)_propagated_ sign bit or compare-negative against the right-_shifted_ sign bit
makes no difference.
That was fun!
:D
Best regards
-Tobias
[1] Beware, just recently, Raymond Chen pointed out the "lie in the CMN": https://devblogs.microsoft.com/oldnewthing/20210607-00/?p=105288
>
> In the JIT's assembler back end for ARMv8 this is done by two methods:
>
> concretizeMulOverflowRRR
> "ARMv8 has no multiply overflow detection. Instead it is synthesized from the two halves of
> a 64x64=>128 bit multiply. The upper 64-bits are tested. The sequence is
> high64 := SMULH a,b
> a/low64 := MUL a,b
> signBit := low64/a >> 63
> high64 := high64 + signBit
> If high64 is zero after this sequence then the multiply has not overflowed,
> since high64 is an extension of signBit if no overflow (either 0 or -1) and
> both -1 + 1 = 0 and 0 + 0 = 0. Note that because we restrict ourselves to
> three concrete ARMv8 instructions per abstract instruction the last operation
> of the sequence is generated in concretizeMulOverflowJump.
>
> C6.2.196 MUL C6-1111
> C6.2.242 SMULH C6-1184
> C6.2.180 LSR (immediate) C6-1081 110100110 (1)"
>
> <inline: true>
> | reg1 reg2 reg3 |
> reg1 := operands at: 0.
> reg2 := operands at: 1.
> reg3 := operands at: 2.
> "RISCTempReg := high(reg1 * reg2); must precede destructive MUL"
> machineCode
> at: 0
> put: 2r1001101101 << 22
> + (reg1 << 16)
> + (XZR << 10)
> + (reg2 << 5)
> + RISCTempReg.
> "reg3 := reg1 * reg2"
> machineCode
> at: 1
> put: 2r10011011 << 24
> + (reg1 << 16)
> + (XZR << 10)
> + (reg2 << 5)
> + reg3.
> "CArg1Reg := sign(reg3)"
> machineCode
> at: 2
> put: 2r1101001101 << 22
> + (63 << 16) "constant to shift by"
> + (63 << 10)
> + (reg3 << 5)
> + CArg1Reg. "cuz CArg0Reg == TempReg"
> "RISCTempReg := RISCTempReg + CArg1Reg/sign
> is in concretizeMulOverflowJump"
> "cogit processor disassembleInstructionAt: 0 In: machineCode object"
> "cogit processor disassembleInstructionAt: 4 In: machineCode object"
> "cogit processor disassembleInstructionAt: 8 In: machineCode object"
> ^12
>
> concretizeMulOverflowJump
> "Sizing/generating jumps.
> Jump targets can be to absolute addresses or other abstract instructions.
> Generating initial trampolines instructions may have no maxSize and be to absolute addresses.
> Otherwise instructions must have a machineCodeSize which must be kept to."
> <inline: false>
> | offset |
> offset := self computeJumpTargetOffset - 4. "-4 because the jump is from the second word..."
> self assert: (offset ~= 0 and: [self isInImmediateBranchRange: offset]).
> "See concretizeMulOverflowRRR
> RISCTempReg := RISCTempReg + CArg1Reg/sign.
> JumpZero/NonZero"
> machineCode
> at: 0
> put: 2r10001011 << 24
> + (ArithmeticAddS << 29)
> + (CArg1Reg << 16)
> + (RISCTempReg << 5)
> + RISCTempReg;
> at: 1
> put: (self
> cond: (opcode = JumpMulOverflow ifTrue: [NE] ifFalse: [EQ])
> offset: offset). "B offset"
> ^8
>
> Best regards
> -Tobias
>
>
>
> > cheers
> >
> > bruce
> >
> > On 2021-06-20T22:44:36.000+02:00, Bruce O'Neel <bruce.oneel at pckswarms.ch> wrote:
> > Playing with Git bisect gets us to the commit below as the first bad one.
> >
> > I started the git bisect with the two commits below. The first one is bad, the second is the good one.
> >
> > a783502b249c4a4fedc88b6e07837d405feab144 - zero9 - builds, zero error.
> >
> > 9f73148b8da4bc00278b83faa8da6b1c418fa54f - zero10 - builds works
> >
> > Now this is not 100% guaranteed because one of the commits that git bisect chose had a build error so I had to mark that one as bad. But the commit found does sound possible.
> >
> > f632ee2888014ee88330ee994e13c9c609b57b5f is the first bad commit
> > commit f632ee2888014ee88330ee994e13c9c609b57b5f
> > Author: Eliot Miranda <eliot.miranda at gmail.com>
> > Date: Wed Sep 2 10:45:27 2020 -0700
> >
> > CogVM source as per VMMaker.oscog-eem.2799/ClosedVMMaker-eem.98
> >
> > Ha! I am *STUPID*.
> > Integer overflow is not only determined by the upper 64-bits of a 64x64=>128 bit
> > multiply being either all zero or all ones (0 or -1), but by the upper 64-bits
> > being an extension of the most significant bit of the lower 64 bits!!
> >
> > :040000 040000 ffb8fbcd8ab5e2ca1936429b2daaade62909c178 a5ac89d8b1d4980c5329835cdd3d4a2387b1fac5 M nsspur64src
> > :040000 040000 e606dac14ee1db4ac59b0949bb03cd8e657d7aa7 be666051cd1d52d0055e319432b66a0fba61063e M spur64src
> > :040000 040000 1dea4faf99821c60e2c2461076bfe8b99d4dea9b afbef904e8cbc7e4b80e1606420dd6293e12f3e9 M spurlowcode64src
> > :040000 040000 f66c681004806d5880af7c0a3115038f4f2f3361 0e8d76bedcf22b6ad7497ea2e3dc44f3c36c2f3a M spursista64src
> >
> > On 2021-06-20T21:48:49.000+02:00, Craig Latta <craig at blackpagedigital.com> wrote:
> > Hi Ken--
> >
> > I brought up a development environment on a Raspberry Pi 4.
> >
> > > Unfortunately, my attempt to build a VM simulator on aarch64 fails due
> > > to a divide by zero bug.
> > >
> > > Perhaps because
> > > 16r8000000000000000 printStringHex. ==> '0'
> >
> > update-eem-463.mcm has a postload action which turns on Sista,
> > which recompiles the system, which tries to recompile
> > FloatArrayTest>>testFloatArrayPluginPrimitiveAtPut, which exercises this
> > very bug during scanning, when trying to create the float 1e-127.
> >
> > For now I've rewritten testFloatArrayPluginPrimitiveAtPut and
> > testFloatArrayPluginPrimitiveAt on my system to do nothing. When
> > building GdbARMv8Plugin, there's some other kind of
> > header-include-ordering mayhem, similar to the problem with Linux
> > features.h when building the VM. (At least two levels of it, between
> > bfd.h and the aarch64 simulator's config.h, and config.h and other
> > system headers.) I expect no one has successfully built GdbARMv8Plugin
> > on anything but a M1 macOS machine so far.
> >
> >
> > -C
More information about the Vm-dev
mailing list