[Vm-dev] Debugging Spur32 genPrimitiveHighBit
Nicolas Cellier
nicolas.cellier.aka.nice at gmail.com
Wed Feb 19 01:25:49 UTC 2020
I think that my confusion came from the order of specification, like this
mod operand1:reg operand2:r/m
https://www.felixcloutier.com/x86/lzcnt (same as Intel manuals)
Capture d’écran 2020-02-19 à 02.17.06.png
(14 Ko)
<https://mail.google.com/mail/u/0?ui=2&ik=94f5e792ab&attid=0.2&permmsgid=msg-a:r6786606863442654297&view=att&disp=safe&realattid=f_k6smnaem1>
versus mod: modReg RM: r/m RO: reg
Intel uses such mixture of middle endian in the documentation too...
Capture d’écran 2020-02-19 à 02.13.57.png
(23 Ko)
<https://mail.google.com/mail/u/0?ui=2&ik=94f5e792ab&attid=0.1&permmsgid=msg-a:r6786606863442654297&view=att&disp=safe&realattid=f_k6smjvcd0>
I have also further fixed the rexw:r:x:b: for X64 encoding of R8-R15 in
LZCNT and BSR (but did not regenerate, that can wait)
Capture d’écran 2020-02-19 à 02.22.21.png
(31 Ko)
<https://mail.google.com/mail/u/0?ui=2&ik=94f5e792ab&attid=0.3&permmsgid=msg-a:r6786606863442654297&view=att&disp=safe&realattid=f_k6smux4k2>
Le mer. 19 févr. 2020 à 00:51, Nicolas Cellier <
nicolas.cellier.aka.nice at gmail.com> a écrit :
> ModR/M is a byte which encodes instruction operands
> https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
> mod is the mod field (2 bits) RM is the r/m field (3 bits), RO is the reg
> (or opcode) field (3 bits).
> For LZCNT dest is in RO, and mask (source) is in r/m.
> Same for BSR.
> So I had it wrong... will fix ASAP
>
> Le mar. 18 févr. 2020 à 23:37, Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> a écrit :
>
>> It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O
>> means output, not operand...
>>
>> Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <
>> nicolas.cellier.aka.nice at gmail.com> a écrit :
>>
>>> Hi all,
>>> I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
>>> The registers are swapped!
>>>
>>> Here is an extract:
>>>
>>> 0x5c26a9c: 83 e0 01 andl $0x1, %eax
>>> 0x5c26a9f: eb 11 jmp 0x5c26ab2
>>> 0x5c26aa1: 90 nop
>>> 0x5c26aa2: 90 nop
>>> 0x5c26aa3: 90 nop
>>> 0x5c26aa4: 89 d0 movl %edx, %eax
>>> 0x5c26aa6: 83 e0 03 andl $0x3, %eax
>>> 0x5c26aa9: 75 f1 jne 0x5c26a9c
>>> 0x5c26aab: 8b 02 movl (%edx), %eax
>>> 0x5c26aad: 25 ff ff 3f 00 andl $0x3fffff, %eax ; imm =
>>> 0x3FFFFF
>>> 0x5c26ab2: 39 c8 cmpl %ecx, %eax
>>> 0x5c26ab4: 75 e0 jne 0x5c26a96
>>> 0x5c26ab6: f3 0f bd d0 lzcntl %eax, %edx
>>> 0x5c26aba: 74 0d je 0x5c26ac9
>>> 0x5c26abc: 35 1f 00 00 00 xorl $0x1f, %eax
>>> 0x5c26ac1: 89 c2 movl %eax, %edx
>>> 0x5c26ac3: d1 e2 shll %edx
>>> 0x5c26ac5: 83 c2 01 addl $0x1, %edx
>>> 0x5c26ac8: c3 retl
>>>
>>> What happens is that we count the leading zeros in $eax (TempReg) and
>>> store the result in $edx (ReceiverResultReg) ...
>>>
>>> 0x5c26ab6: f3 0f bd d0 lzcntl %eax, %edx
>>>
>>> We want the contrary!
>>>
>>> $eax contains 1, presumably because we used it to check for SmallInteger
>>> tag bit:
>>>
>>> 0x5c26a9c: 83 e0 01 andl $0x1, %eax
>>>
>>> So we invariably get 31 leading zeroes in $edx (but we will later
>>> overwrite the contents of $edx).
>>>
>>> Then we interpret $eax as the result (thus 1 leading zero), bitInvert,
>>> and get 30 as the result for highBit, store that in $edx (shifted and
>>> tagged), and we're done... Err!
>>>
>>> Obviously the code generation is wrong!
>>> It did work when I first wrote it, and still work on x64 because we use
>>> $eax (TempReg) as both source and dest reg...
>>>
>>> Though, I do not see what we are doing wrong:
>>>
>>> concretizeClzRR
>>> <inline: true>
>>> | maskReg dest |
>>> maskReg := operands at: 0.
>>> dest := operands at: 1.
>>> machineCode
>>> at: 0 put: 16rF3;
>>> at: 1 put: 16r0F;
>>> at: 2 put: 16rBD;
>>> at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
>>> ^4
>>>
>>> and we invoke it like that:
>>> cogit ClzR: srcReg R: destReg.
>>>
>>> ClzR: reg1 R: reg2
>>> "reg2 := reg1 countLeadingZeros"
>>> <inline: true>
>>> <returnTypeC: #'AbstractInstruction *'>
>>> ^self gen: ClzRR operand: reg1 operand: reg2
>>>
>>> So it seems to me that all is in the correct order...
>>> cogitIA32 likewise seems perfrectly correct:
>>>
>>> static sqInt
>>> genPrimitiveHighBit(void)
>>> {
>>> AbstractInstruction *anInstruction11;
>>> AbstractInstruction *anInstruction2;
>>> AbstractInstruction *anInstruction4;
>>> AbstractInstruction *jumpNegativeReceiver;
>>> AbstractInstruction *jumpNegativeReceiver11;
>>> AbstractInstruction *jumpNegativeReceiver3;
>>> sqInt literal1;
>>>
>>>
>>> /* remove excess tag bits from the receiver oop */
>>>
>>> /* and use the abstract cogit facility for case of single
>>> tag-bit */
>>> /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
>>> if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
>>> /* begin
>>> genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
>>> genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
>>> if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
>>> /* begin checkQuickConstant:forInstruction: */
>>> anInstruction2 = genoperandoperand(CmpCqR, 0,
>>> TempReg);
>>> }
>>>
>>> /* Note the nice bit trick below:
>>> highBit_1based_of_small_int_value = (BytesPerWord *
>>> 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
>>> This is like 2 complements (- reg - 1) on
>>> (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
>>> jumpNegativeReceiver3 =
>>> genConditionalBranchoperand(JumpZero, ((sqInt)0));
>>> /* begin checkLiteral:forInstruction: */
>>> literal1 = (BytesPerWord * 8) - 1;
>>> anInstruction11 = genoperandoperand(XorCwR,
>>> (BytesPerWord * 8) - 1, TempReg);
>>> jumpNegativeReceiver = jumpNegativeReceiver3;
>>> goto l10;
>>> }
>>>
>>> which concretize in:
>>>
>>> case ClzRR:
>>> /* begin concretizeClzRR */
>>> maskReg = ((self_in_dispatchConcretize->operands))[0];
>>> dest = ((self_in_dispatchConcretize->operands))[1];
>>> ((self_in_dispatchConcretize->machineCode))[0] = 243;
>>> ((self_in_dispatchConcretize->machineCode))[1] = 15;
>>> ((self_in_dispatchConcretize->machineCode))[2] = 189;
>>> ((self_in_dispatchConcretize->machineCode))[3] =
>>> (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
>>> return 4;
>>>
>>> The order seems correct all the way down...
>>> As a workaround, I could revert Eliot's optimization and force a
>>> cogit MoveR: ReceiverResultReg R: TempReg.
>>> But I'd rather want to understand where's the problem...
>>> Another pair of eyes may help!
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200219/ba2c1b77/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Capture d’écran 2020-02-19 à 02.13.57.png
Type: image/png
Size: 22889 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200219/ba2c1b77/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Capture d’écran 2020-02-19 à 02.17.06.png
Type: image/png
Size: 13897 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200219/ba2c1b77/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Capture d’écran 2020-02-19 à 02.22.21.png
Type: image/png
Size: 31517 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200219/ba2c1b77/attachment-0005.png>
More information about the Vm-dev
mailing list