[Vm-dev] Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

Eliot Miranda eliot.miranda at gmail.com
Wed Oct 9 21:24:04 UTC 2019


On Wed, Oct 9, 2019 at 2:00 PM Nicolas Cellier <
nicolas.cellier.aka.nice at gmail.com> wrote:

>
> Hi Eliot,
> Yesterday I thought I would opt for 4) because it's TSTTCPW.
> Of course it's very hackish and trampoline specific.
> But loading stack and frame pointer is very specific.
>

We are of the same mind.  When I look at the code it is currently always
safe to do this.  So implementing #4 and documenting the assumption
properly is, I think, enough.  We can then revert installing sigaltstack
being the default for a COGVM and properly test.

P.S. Here's how I search for all occurrences:

| bindings |
bindings := { Cogit bindingOf: #FPReg. Cogit bindingOf: #SPReg}.
self systemNavigation browseAllSelect:
[:m|
m scanForInstructionSequence:
[:a :b|
a selector == #pushLiteralVariable: and: [(bindings includes: a argument)
and: [b selector = #send:super:numArgs: and: [b argument == #MoveAw:R:]]]]]

Searching for MoveR:Aw: isn't nearly as easy :-(.



>
> Le mer. 9 oct. 2019 à 22:10, Eliot Miranda <eliot.miranda at gmail.com> a
> écrit :
>
>>
>> Hi Nicolas, Clément, et al,
>>
>> On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <
>> nicolas.cellier.aka.nice at gmail.com> wrote:
>>
>>> More on the problem that Eliot is speaking about: it can happen in these
>>> conditions:
>>> if is SIGIO is delivered while executing any trampoline transiting from
>>> SmalltalkToCStackSwitch.
>>> What happens here?
>>>
>>> We must load 64bits contents of memory (cStackPointerAddress) to the
>>> stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
>>> This is done by using the CogRTLOpcode abstract instruction MoveAwR
>>> But we have no matching instruction in IA-32 X64... We can only load a
>>> 64bits memory content in $rax
>>> So the idea is to generate this sequence
>>> (CogX64Compiler>>concretizeMoveAwR):
>>>     xchgq  %rsp, %rax
>>>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>>>     xchgq  %rsp, %rax
>>> That's clever because it preserves $rax which could be in use when we
>>> want to MoveAwR.
>>>
>>
>> Clever, but in the case of %rsp and %rbp, dead wrong :blush:.  The
>> constraint is that x86_64 only provides a 64-bit load from an absolute
>> address into %rax; no other register can be used.
>>
>>
>>> But it has an unfortunate side effect: the stack pointer temporarily
>>> gets the contents of $rax and can thus temporarily point anywhere.
>>>
>>
>> Right; which is unacceptable.  It's a tiny window, but one we hit all the
>> time.
>>
>>
>>> What happens when performing Squeak SocketTest is that we use some
>>> trampolines (I guess for invoking primitives for example) and we generate
>>> SIGIO (for some reasons, there are a lot of SIGIO generated on my
>>> particular macos machine, so I can trigger the bug more easily than Eliot).
>>> We previously installed a handler for SIGIO via signal(). When we use
>>> signal(), the handler shares the stack pointer with user program.
>>> If the event is delivered in between the two xchgq instructions above,
>>> the signal handler will then use a corrupted stack pointer pointing
>>> anywhere (depending no contents of $rax) when the VM enter the signal
>>> handler function, it uses stack pointer to save some states, and corrupt a
>>> memory zone, segfault or whatever.
>>> In the case described by the opensmalltalk vm-dev thread, $rax was
>>> pointing to the generated code zone (jitted methodZone), so we corrupted
>>> the generated code and soon get punished for that. But it's probable that
>>> there might be other (rare) occurrences of this bug.
>>>
>>> Not sure if it is causing the bugs described by Sean, but it's important
>>> to use the fix from Eliot ASAP and retry.
>>> There might be other occurrence of signal(SIGIO,forceInterruptCheck) in
>>> minheadless flavour, I did not check if Eliot also corrected it, if not it
>>> should also be corrected ASAP, as should every usage of signal() be
>>> replaced by ussage of sigaction() with appropriate flags to use
>>> sigaltstack() - see Eliot's commit details.
>>>
>>
>> I want to discuss potential fixes with Clément and Nicolas, and anyone
>> else interested. So...
>>
>> 1. the straight forward fix is to generate different code for setting
>> %rsp and %rbp.  I shall do this very soon.  On x86_64 we dedicate a
>> register for code generation purposes, this is called RISCTempReg, and is
>> either %r8 (SysV ABI (unix)) or %r11 (Windows).  RISCTempReg is never
>> assumed to be live except within a single instruction sequence.  We can
>> swap with this before assigning to %rsp, e.g.
>>
>>     xchgq  %r8, %rax
>>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>>     xchgq  %r8, %rax
>>     movq %r8, %rsp
>>
>> this adds a couple of bytes, but is reliable.
>>
>> 2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg
>> would be either %r8 or %r11, and RISCTemptReg would be %rax.  That would
>> allow
>>
>>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>>     movq %rax, %rsp
>>
>> this at least worked on Ryan's MIPS32 back end when it was in use (no one
>> has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is
>> AT (r1) and CResultRegister is V0 (r2).  So this is worth investigating.
>>  [I had tried something similar with HPS and it completely broke the code
>> generator so my gut is trying to tell me this will never work; Ryan has
>> proved otherwise].
>>
>> 3. I would much rather implement some form of tracking whether a
>> particular register is live or not.  A trivial implementation would only
>> track TempReg and only track being not live up until the first assignment
>> to TempReg.  A more sophisticated approach could deal with control flow
>> branching and merging of the liveness info.  I like the trivial approach
>> for now.  That would simply track TempReg and only up until the first
>> explicit assignment to TempReg.
>>
>> 4. we could simply treat assignments to SPReg and FPReg specially,
>> knowing that these are done only in trampolines and enilopmarts (these are
>> the pieces of code that sit b between JITted code and the rest of the
>> run-time (interpreter, memory manager, primitives, etc), either to call the
>> runtime from JITted code or, as in this case, to enter machine code from
>> the run-time.  I suppose this is OK as long as we document it.  This would
>> allow us to generate what we would generate for #2 above, i.e.
>>
>>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>>     movq %rax, %rsp
>>
>> N. are there better ways?
>>
>> Clément, Nicolas (et al), what do you think?
>>
>> Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <eliot.miranda at gmail.com> a
>> écrit :
>>
>>> Hi Sean, Hi All,
>>>>
>>>>     this may be because of the issue described here:
>>>> http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html
>>>>
>>>> This issue is characterized by the system crashing soon after start up
>>>> when some significant i/o is done, typically either to files or sockets.
>>>> It affects macOS only and may indeed affect only 64-bits.  We have strong
>>>> evidence that it is caused by the dynamic linker being invoked in the
>>>> signal handler for SIGIO when the signal is delivered while the VM is
>>>> executing JITted code.  The symptom that causes the crash is corruption of
>>>> a particular jitted method’s machine code, eg Delay class>>#startEventLoop,
>>>> and we believe that the corruption is caused by the linker when it
>>>> misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack
>>>> frame and attempts to scan code to link it.
>>>>
>>>> Our diagnosis is speculative; this is extremely hard to reproduce.
>>>> Typically in repeating a crashing run SIGIO may no longer be delivered at
>>>> the same point because any remote server has now woken up and delivers
>>>> results sooner, etc.  However, Nicolas Cellier and I are both confident
>>>> that we have correctly identified the bug.
>>>>
>>>> The fix is simple; SIGIO should be delivered on a dedicated signal
>>>> stack (see sigaltstack(2)).  I committed a fix yesterday evening and we
>>>> should see within a week or so if these crashes have disappeared.
>>>>
>>>> I encourage the Pharo vm maintainers to build and release vms that
>>>> include
>>>> https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211
>>>>  asap.
>>>>
>>>>
>>>> Cheers,
>>>> Eliot
>>>> _,,,^..^,,,_ (phone)
>>>>
>>>> On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <sean at clipperadams.com>
>>>> wrote:
>>>>
>>>> Segmentation fault Thu Oct  3 15:52:33 2019
>>>>
>>>>
>>>> VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>>> Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
>>>> Plugins: 201901051900
>>>> https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>>>
>>>> C stack backtrace & registers:
>>>>    rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260
>>>> rdx
>>>> 0x0000000000dd6800
>>>>    rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50
>>>> rsp
>>>> 0x00007ffeebcffe50
>>>>    r8  0x00007fff3f2cefe5 r9  0x0000000000000b00 r10 0x0000000000006000
>>>> r11
>>>> 0xfffffffffcd8d5a0
>>>>    r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064
>>>> r15
>>>> 0x0000000000002800
>>>>    rip 0x00007fff630f7d09
>>>> 0   libsystem_platform.dylib            0x00007fff630f7d09
>>>> _platform_memmove$VARIANT$Haswell + 41
>>>> 1   Pharo                               0x0000000103f52642
>>>> reportStackState
>>>> + 952
>>>> 2   Pharo                               0x0000000103f52987 sigsegv + 174
>>>> 3   libsystem_platform.dylib            0x00007fff630fab3d _sigtramp +
>>>> 29
>>>> 4   ???                                 0x0000058900000a00 0x0 +
>>>> 6085968660992
>>>> 5   libGLImage.dylib                    0x00007fff3f2ce29e
>>>> glgProcessPixelsWithProcessor + 2149
>>>> 6   AMDRadeonX5000GLDriver              0x000000010db16db1
>>>> glrATIStoreLevels
>>>> + 1600
>>>> 7   AMDRadeonX5000GLDriver              0x000000010db52c83
>>>> glrAMD_GFX9_LoadSysTextureStandard + 45
>>>> 8   AMDRadeonX5000GLDriver              0x000000010db519bb
>>>> glrUpdateTexture
>>>> + 1346
>>>> 9   libGPUSupportMercury.dylib          0x00007fff5181279d
>>>> gpusLoadCurrentTextures + 591
>>>> 10  AMDRadeonX5000GLDriver              0x000000010db5a099
>>>> gldUpdateDispatch
>>>> + 397
>>>> 11  GLEngine                            0x00007fff3ff72078
>>>> gleDoDrawDispatchCore + 629
>>>> 12  GLEngine                            0x00007fff3ff16369
>>>> glDrawArraysInstanced_STD_Exec + 264
>>>> 13  GLEngine                            0x00007fff3ff1625a
>>>> glDrawArrays_UnpackThread + 40
>>>> 14  GLEngine                            0x00007fff3ff6dce1
>>>> gleCmdProcessor +
>>>> 77
>>>> 15  libdispatch.dylib                   0x00007fff62ec2dcf
>>>> _dispatch_client_callout + 8
>>>> 16  libdispatch.dylib                   0x00007fff62ecea2c
>>>> _dispatch_lane_barrier_sync_invoke_and_complete + 60
>>>> 17  GLEngine                            0x00007fff3fec4b85
>>>> glFlush_ExecThread + 15
>>>> 18  Pharo                               0x0000000103f4cc62
>>>> -[sqSqueakOSXOpenGLView drawRect:flush:] + 314
>>>> 19  Pharo                               0x0000000103f4cb22 -
>>>> ...
>>>>
>>>> Smalltalk stack dump:
>>>>    0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
>>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>>>    0x7ffeebd14270 M [] in
>>>>
>>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>>>       0x1125923f8 s BlockClosure>ensure:
>>>>       0x111e88d30 s
>>>>
>>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>>>       0x112590a50 s [] in
>>>>
>>>> DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
>>>>       0x111e88e08 s [] in BlockClosure>newProcess
>>>>
>>>> Most recent primitives
>>>> @
>>>> actualScreenSize
>>>> millisecondClockValue
>>>> tempAt:
>>>>
>>>>
>>>>
>>>> -----
>>>> Cheers,
>>>> Sean
>>>> --
>>>> Sent from:
>>>> http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>>>>
>>>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20191009/5f23b330/attachment-0001.html>


More information about the Vm-dev mailing list