[Vm-dev] Aarch64 number bug

Mon Jul 5 18:32:10 UTC 2021

Hi Eliot

> On 2. Jul 2021, at 16:07, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> 
> […]
> 
> From what I see, tho, is that the MUL and the SMULH instruction order has been reversed.
> 
> Indeed.  I created a regression.  Apologies for all this effort over my mistake.

This happens :)

> 
>> 
>> (Also, the arm blog uses an different interesting method of detecting overflow,
>> by missusing the argument shitfter of CMP: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/detecting-overflow-from-mul
>> Although this is written for older arm, it seems to be applicable to the 64-variant too)
>> 
> So here's how I decided to do it on AArch64.  Maybe you can find something better.
> 
> AArch64 implement 64 x 64 => 128 bit multiply by providing a 64x64 => low 64 bits and a 64x64 => high 64 bits multiply instruction.  So if a 64x64=>128 multiply has not overflowed the high 64 bits will be an extension of the sign bit of the 64x64=>low 64 instruction.  

Yes.

> i.e. high 64 will either be all 0s or all 1s (all 1s = -1).

Yes

> So if one subtracts the sign bit from the high 64 result one gets 0 if there is no overflow.  Here's the instruction sequence:
> 
> 	smulh	x9/RISCTemp, x4/Class, x8/Arg1
> 	mul		x4/Class, x4/Class, x8/Arg1
> 	lsr		x1, x4/Class, #63

Ok, cases are:

		OVF	x9	x4[63]	x1
	1.	N	ff..	1     	1
	2.	Y	ff..	0     	0
	3.	Y	00..	1     	1
	4.	N	00..	0     	0
	5.	Y	1-7f..	1     	1
	6.	Y	1-7f..	0     	0

x1 is either 0 or 1, because LSR shifts in 0.

> 	adds	x9/RISCTemp, x9/RISCTemp, x1

		OVF	x9 	x4[63]	x1	x9'   	V Z N
	1.	N	ff..  	1     	1	0..0  	1 1 0
	2.	Y	ff..  	0     	0	ff..  	0 0 1
	3.	Y	00..  	1     	1	0..1  	0 0 0
	4.	N	00..  	0     	0	0..0  	0 1 0
	5a.	Y	1-7f.e	1     	1	2-7f.f	0 0 0
	5b.	Y	7f..f	1     	1	80..0 	0 0 1
	6.	Y	1-7f.f	0     	0	1-7f.f	0 0 0

> 	b.ne		0x44b8  // b.any = *@88 (to overflow)

B.NE branches on Z = 0

== So, yes, this does what you want. ==

Let me also try to adapt what the blog post suggested:

	smulh	x9/RISCTemp, x4/Class, x8/Arg1
	mul  	x4/Class, x4/Class, x8/Arg1
	cmp  	x9/RISCTemp, x4/Class, ASR #63
	b.ne	// on overflow

With the ASR modifier on x4 effectively propagating the signbit all through the LSB just for this operation

		OVF	x9	x4[63]	x4 ASR	intern	Z
	1.	N	ff..	1     	ff..f	00..	1
	2.	Y	ff..	0     	00..0	ff..	0
	3.	Y	00..	1     	ff..f	00..01	0
	4.	N	00..	0     	00..0	00..	1
	5.	Y	1-7f..	1     	ff..f	2-80..0	0
	6.	Y	1-7f..	0     	00..0	1-7ff..	0

You would generate that as:

CMP(ASR) = 
	2r1101011100 << 22
	+ (reg1 << 16) "x4/Class"
	+ (63 << 10) "ASR shift amount"
	+ (reg2 << 5) "x9/RISCTemp"
	+ XZR "This differentiates CMP from SUBS"

This has the following properties:

- its just 3 concrete ARM instruction, so no need to split it.
- After the sequence:
	- x4/Class has (x4/Class * x8/Arg1)
	- x9/RISCTemp is either all 00..00 or all ff..ff
	- Z == 1 on no overflow, Z == 0 on overflow.
- It "speaks" "compare all x9 bits to bit 63 of x4"

=====

Both things get the job done.

I think the original lsr+adds can be improved, tho.
First, you can push the lsr op into the shifter-arg of adds.
Second, you can use xzr as dest reg, which is equivalent to cmn (compare neg). [1]

But whether to compare against the (right-)_propagated_ sign bit or compare-negative against the right-_shifted_ sign bit 
makes no difference.

That was fun!

:D

Best regards
	-Tobias

[1] Beware, just recently, Raymond Chen pointed out the "lie in the CMN": https://devblogs.microsoft.com/oldnewthing/20210607-00/?p=105288

> 
> In the JIT's assembler back end for ARMv8 this is done by two methods:
> 
> concretizeMulOverflowRRR
> 	"ARMv8 has no multiply overflow detection.  Instead it is synthesized from the two halves of
> 	 a 64x64=>128 bit multiply. The upper 64-bits are tested.  The sequence is
> 		high64 := SMULH a,b
> 		a/low64 := MUL a,b
> 		signBit := low64/a >> 63
> 		high64 := high64 + signBit
> 	 If high64 is zero after this sequence then the multiply has not overflowed,
> 	 since high64 is an extension of signBit if no overflow (either 0 or -1) and
> 	 both -1 + 1 = 0 and 0 + 0 = 0. Note that because we restrict ourselves to
> 	 three concrete ARMv8 instructions per abstract instruction the last operation
> 	 of the sequence is generated in concretizeMulOverflowJump.
> 
> 	 C6.2.196	MUL				C6-1111
> 	 C6.2.242	SMULH				C6-1184
> 	 C6.2.180	LSR (immediate)	C6-1081	110100110 (1)"
> 
> 	<inline: true>
> 	| reg1 reg2 reg3 |
> 	reg1 := operands at: 0.
> 	reg2 := operands at: 1.
> 	reg3 := operands at: 2.
> 	"RISCTempReg := high(reg1 * reg2); must precede destructive MUL"
> 	machineCode
> 		at: 0
> 		put: 2r1001101101 << 22
> 			+ (reg1 << 16)
> 			+ (XZR << 10)
> 			+ (reg2 << 5)
> 			+ RISCTempReg.
> 	"reg3 := reg1 * reg2"
> 	machineCode
> 		at: 1
> 		put: 2r10011011 << 24
> 			+ (reg1 << 16)
> 			+ (XZR << 10)
> 			+ (reg2 << 5)
> 			+ reg3.
> 	"CArg1Reg := sign(reg3)"
> 	machineCode
> 		at: 2
> 		put: 2r1101001101 << 22
> 			+ (63 << 16) "constant to shift by"
> 			+ (63 << 10)
> 			+ (reg3 << 5)
> 			+ CArg1Reg. "cuz CArg0Reg == TempReg"
> 	"RISCTempReg := RISCTempReg + CArg1Reg/sign
> 	 is in concretizeMulOverflowJump"
> 	"cogit processor disassembleInstructionAt: 0 In: machineCode object"
> 	"cogit processor disassembleInstructionAt: 4 In: machineCode object"
> 	"cogit processor disassembleInstructionAt: 8 In: machineCode object"
> 	^12
> 
> concretizeMulOverflowJump
> 	"Sizing/generating jumps.
> 		Jump targets can be to absolute addresses or other abstract instructions.
> 		Generating initial trampolines instructions may have no maxSize and be to absolute addresses.
> 		Otherwise instructions must have a machineCodeSize which must be kept to."
> 	<inline: false>
> 	| offset |
> 	offset := self computeJumpTargetOffset - 4. "-4 because the jump is from the second word..."
>  	self assert: (offset ~= 0 and: [self isInImmediateBranchRange: offset]).
> 	"See concretizeMulOverflowRRR
> 	 RISCTempReg := RISCTempReg + CArg1Reg/sign.
> 	 JumpZero/NonZero"
> 	machineCode
> 		at: 0
> 		put: 2r10001011 << 24
> 			+ (ArithmeticAddS << 29)
> 			+ (CArg1Reg << 16)
> 			+ (RISCTempReg << 5)
> 			+ RISCTempReg;
> 		at: 1
> 		put: (self
> 				cond: (opcode = JumpMulOverflow ifTrue: [NE] ifFalse: [EQ])
> 				offset: offset). "B offset"
> 	^8
> 
> Best regards
>         -Tobias
> 
> 
> 
> > cheers
> > 
> > bruce
> > 
> > On 2021-06-20T22:44:36.000+02:00, Bruce O'Neel <bruce.oneel at pckswarms.ch> wrote:
> > Playing with Git bisect gets us to the commit below as the first bad one.
> > 
> > I started the git bisect with the two commits below.  The first one is bad, the second is the good one.
> > 
> > a783502b249c4a4fedc88b6e07837d405feab144 - zero9 - builds, zero error.
> > 
> > 9f73148b8da4bc00278b83faa8da6b1c418fa54f - zero10 - builds works
> > 
> > Now this is not 100% guaranteed because one of the commits that git bisect chose had a build error so I had to mark that one as bad.  But the commit found does sound possible.
> > 
> > f632ee2888014ee88330ee994e13c9c609b57b5f is the first bad commit
> > commit f632ee2888014ee88330ee994e13c9c609b57b5f
> > Author: Eliot Miranda <eliot.miranda at gmail.com>
> > Date:   Wed Sep 2 10:45:27 2020 -0700
> > 
> >     CogVM source as per VMMaker.oscog-eem.2799/ClosedVMMaker-eem.98
> >     
> >     Ha!  I am *STUPID*.
> >     Integer overflow is not only determined by the upper 64-bits of a 64x64=>128 bit
> >     multiply being either all zero or all ones (0 or -1), but by the upper 64-bits
> >     being an extension of the most significant bit of the lower 64 bits!!
> > 
> > :040000 040000 ffb8fbcd8ab5e2ca1936429b2daaade62909c178 a5ac89d8b1d4980c5329835cdd3d4a2387b1fac5 M      nsspur64src
> > :040000 040000 e606dac14ee1db4ac59b0949bb03cd8e657d7aa7 be666051cd1d52d0055e319432b66a0fba61063e M      spur64src
> > :040000 040000 1dea4faf99821c60e2c2461076bfe8b99d4dea9b afbef904e8cbc7e4b80e1606420dd6293e12f3e9 M      spurlowcode64src
> > :040000 040000 f66c681004806d5880af7c0a3115038f4f2f3361 0e8d76bedcf22b6ad7497ea2e3dc44f3c36c2f3a M      spursista64src
> > 
> > On 2021-06-20T21:48:49.000+02:00, Craig Latta <craig at blackpagedigital.com> wrote:
> > Hi Ken--
> > 
> > I brought up a development environment on a Raspberry Pi 4.
> > 
> > > Unfortunately, my attempt to build a VM simulator on aarch64 fails due
> > > to a divide by zero bug.
> > >
> > > Perhaps because
> > > 16r8000000000000000 printStringHex. ==> '0'
> > 
> > update-eem-463.mcm has a postload action which turns on Sista, 
> > which recompiles the system, which tries to recompile 
> > FloatArrayTest>>testFloatArrayPluginPrimitiveAtPut, which exercises this 
> > very bug during scanning, when trying to create the float 1e-127.
> > 
> > For now I've rewritten testFloatArrayPluginPrimitiveAtPut and 
> > testFloatArrayPluginPrimitiveAt on my system to do nothing. When 
> > building GdbARMv8Plugin, there's some other kind of 
> > header-include-ordering mayhem, similar to the problem with Linux 
> > features.h when building the VM. (At least two levels of it, between 
> > bfd.h and the aarch64 simulator's config.h, and config.h and other 
> > system headers.) I expect no one has successfully built GdbARMv8Plugin 
> > on anything but a M1 macOS machine so far.
> > 
> > 
> > -C