[Vm-dev] Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]
Nicolas Cellier
nicolas.cellier.aka.nice at gmail.com
Wed Oct 9 20:59:53 UTC 2019
Hi Eliot,
Yesterday I thought I would opt for 4) because it's TSTTCPW.
Of course it's very hackish and trampoline specific.
But loading stack and frame pointer is very specific.
Le mer. 9 oct. 2019 à 22:10, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :
>
> Hi Nicolas, Clément, et al,
>
> On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
>
>> More on the problem that Eliot is speaking about: it can happen in these
>> conditions:
>> if is SIGIO is delivered while executing any trampoline transiting from
>> SmalltalkToCStackSwitch.
>> What happens here?
>>
>> We must load 64bits contents of memory (cStackPointerAddress) to the
>> stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
>> This is done by using the CogRTLOpcode abstract instruction MoveAwR
>> But we have no matching instruction in IA-32 X64... We can only load a
>> 64bits memory content in $rax
>> So the idea is to generate this sequence
>> (CogX64Compiler>>concretizeMoveAwR):
>> xchgq %rsp, %rax
>> movabsq 0x10027c338, %rax ; cStackPointerAddress
>> xchgq %rsp, %rax
>> That's clever because it preserves $rax which could be in use when we
>> want to MoveAwR.
>>
>
> Clever, but in the case of %rsp and %rbp, dead wrong :blush:. The
> constraint is that x86_64 only provides a 64-bit load from an absolute
> address into %rax; no other register can be used.
>
>
>> But it has an unfortunate side effect: the stack pointer temporarily gets
>> the contents of $rax and can thus temporarily point anywhere.
>>
>
> Right; which is unacceptable. It's a tiny window, but one we hit all the
> time.
>
>
>> What happens when performing Squeak SocketTest is that we use some
>> trampolines (I guess for invoking primitives for example) and we generate
>> SIGIO (for some reasons, there are a lot of SIGIO generated on my
>> particular macos machine, so I can trigger the bug more easily than Eliot).
>> We previously installed a handler for SIGIO via signal(). When we use
>> signal(), the handler shares the stack pointer with user program.
>> If the event is delivered in between the two xchgq instructions above,
>> the signal handler will then use a corrupted stack pointer pointing
>> anywhere (depending no contents of $rax) when the VM enter the signal
>> handler function, it uses stack pointer to save some states, and corrupt a
>> memory zone, segfault or whatever.
>> In the case described by the opensmalltalk vm-dev thread, $rax was
>> pointing to the generated code zone (jitted methodZone), so we corrupted
>> the generated code and soon get punished for that. But it's probable that
>> there might be other (rare) occurrences of this bug.
>>
>> Not sure if it is causing the bugs described by Sean, but it's important
>> to use the fix from Eliot ASAP and retry.
>> There might be other occurrence of signal(SIGIO,forceInterruptCheck) in
>> minheadless flavour, I did not check if Eliot also corrected it, if not it
>> should also be corrected ASAP, as should every usage of signal() be
>> replaced by ussage of sigaction() with appropriate flags to use
>> sigaltstack() - see Eliot's commit details.
>>
>
> I want to discuss potential fixes with Clément and Nicolas, and anyone
> else interested. So...
>
> 1. the straight forward fix is to generate different code for setting %rsp
> and %rbp. I shall do this very soon. On x86_64 we dedicate a register for
> code generation purposes, this is called RISCTempReg, and is either %r8
> (SysV ABI (unix)) or %r11 (Windows). RISCTempReg is never assumed to be
> live except within a single instruction sequence. We can swap with this
> before assigning to %rsp, e.g.
>
> xchgq %r8, %rax
> movabsq 0x10027c338, %rax ; cStackPointerAddress
> xchgq %r8, %rax
> movq %r8, %rsp
>
> this adds a couple of bytes, but is reliable.
>
> 2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg
> would be either %r8 or %r11, and RISCTemptReg would be %rax. That would
> allow
>
> movabsq 0x10027c338, %rax ; cStackPointerAddress
> movq %rax, %rsp
>
> this at least worked on Ryan's MIPS32 back end when it was in use (no one
> has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is
> AT (r1) and CResultRegister is V0 (r2). So this is worth investigating.
> [I had tried something similar with HPS and it completely broke the code
> generator so my gut is trying to tell me this will never work; Ryan has
> proved otherwise].
>
> 3. I would much rather implement some form of tracking whether a
> particular register is live or not. A trivial implementation would only
> track TempReg and only track being not live up until the first assignment
> to TempReg. A more sophisticated approach could deal with control flow
> branching and merging of the liveness info. I like the trivial approach
> for now. That would simply track TempReg and only up until the first
> explicit assignment to TempReg.
>
> 4. we could simply treat assignments to SPReg and FPReg specially, knowing
> that these are done only in trampolines and enilopmarts (these are the
> pieces of code that sit b between JITted code and the rest of the run-time
> (interpreter, memory manager, primitives, etc), either to call the runtime
> from JITted code or, as in this case, to enter machine code from the
> run-time. I suppose this is OK as long as we document it. This would
> allow us to generate what we would generate for #2 above, i.e.
>
> movabsq 0x10027c338, %rax ; cStackPointerAddress
> movq %rax, %rsp
>
> N. are there better ways?
>
> Clément, Nicolas (et al), what do you think?
>
> Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <eliot.miranda at gmail.com> a
> écrit :
>
>> Hi Sean, Hi All,
>>>
>>> this may be because of the issue described here:
>>> http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html
>>>
>>> This issue is characterized by the system crashing soon after start up
>>> when some significant i/o is done, typically either to files or sockets.
>>> It affects macOS only and may indeed affect only 64-bits. We have strong
>>> evidence that it is caused by the dynamic linker being invoked in the
>>> signal handler for SIGIO when the signal is delivered while the VM is
>>> executing JITted code. The symptom that causes the crash is corruption of
>>> a particular jitted method’s machine code, eg Delay class>>#startEventLoop,
>>> and we believe that the corruption is caused by the linker when it
>>> misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack
>>> frame and attempts to scan code to link it.
>>>
>>> Our diagnosis is speculative; this is extremely hard to reproduce.
>>> Typically in repeating a crashing run SIGIO may no longer be delivered at
>>> the same point because any remote server has now woken up and delivers
>>> results sooner, etc. However, Nicolas Cellier and I are both confident
>>> that we have correctly identified the bug.
>>>
>>> The fix is simple; SIGIO should be delivered on a dedicated signal stack
>>> (see sigaltstack(2)). I committed a fix yesterday evening and we should
>>> see within a week or so if these crashes have disappeared.
>>>
>>> I encourage the Pharo vm maintainers to build and release vms that
>>> include
>>> https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211
>>> asap.
>>>
>>>
>>> Cheers,
>>> Eliot
>>> _,,,^..^,,,_ (phone)
>>>
>>> On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <sean at clipperadams.com>
>>> wrote:
>>>
>>> Segmentation fault Thu Oct 3 15:52:33 2019
>>>
>>>
>>> VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>> Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
>>> Plugins: 201901051900
>>> https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>>
>>> C stack backtrace & registers:
>>> rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260
>>> rdx
>>> 0x0000000000dd6800
>>> rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50
>>> rsp
>>> 0x00007ffeebcffe50
>>> r8 0x00007fff3f2cefe5 r9 0x0000000000000b00 r10 0x0000000000006000
>>> r11
>>> 0xfffffffffcd8d5a0
>>> r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064
>>> r15
>>> 0x0000000000002800
>>> rip 0x00007fff630f7d09
>>> 0 libsystem_platform.dylib 0x00007fff630f7d09
>>> _platform_memmove$VARIANT$Haswell + 41
>>> 1 Pharo 0x0000000103f52642
>>> reportStackState
>>> + 952
>>> 2 Pharo 0x0000000103f52987 sigsegv + 174
>>> 3 libsystem_platform.dylib 0x00007fff630fab3d _sigtramp + 29
>>> 4 ??? 0x0000058900000a00 0x0 +
>>> 6085968660992
>>> 5 libGLImage.dylib 0x00007fff3f2ce29e
>>> glgProcessPixelsWithProcessor + 2149
>>> 6 AMDRadeonX5000GLDriver 0x000000010db16db1
>>> glrATIStoreLevels
>>> + 1600
>>> 7 AMDRadeonX5000GLDriver 0x000000010db52c83
>>> glrAMD_GFX9_LoadSysTextureStandard + 45
>>> 8 AMDRadeonX5000GLDriver 0x000000010db519bb
>>> glrUpdateTexture
>>> + 1346
>>> 9 libGPUSupportMercury.dylib 0x00007fff5181279d
>>> gpusLoadCurrentTextures + 591
>>> 10 AMDRadeonX5000GLDriver 0x000000010db5a099
>>> gldUpdateDispatch
>>> + 397
>>> 11 GLEngine 0x00007fff3ff72078
>>> gleDoDrawDispatchCore + 629
>>> 12 GLEngine 0x00007fff3ff16369
>>> glDrawArraysInstanced_STD_Exec + 264
>>> 13 GLEngine 0x00007fff3ff1625a
>>> glDrawArrays_UnpackThread + 40
>>> 14 GLEngine 0x00007fff3ff6dce1
>>> gleCmdProcessor +
>>> 77
>>> 15 libdispatch.dylib 0x00007fff62ec2dcf
>>> _dispatch_client_callout + 8
>>> 16 libdispatch.dylib 0x00007fff62ecea2c
>>> _dispatch_lane_barrier_sync_invoke_and_complete + 60
>>> 17 GLEngine 0x00007fff3fec4b85
>>> glFlush_ExecThread + 15
>>> 18 Pharo 0x0000000103f4cc62
>>> -[sqSqueakOSXOpenGLView drawRect:flush:] + 314
>>> 19 Pharo 0x0000000103f4cb22 -
>>> ...
>>>
>>> Smalltalk stack dump:
>>> 0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>> 0x7ffeebd14270 M [] in
>>>
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>> 0x1125923f8 s BlockClosure>ensure:
>>> 0x111e88d30 s
>>>
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>> 0x112590a50 s [] in
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
>>> 0x111e88e08 s [] in BlockClosure>newProcess
>>>
>>> Most recent primitives
>>> @
>>> actualScreenSize
>>> millisecondClockValue
>>> tempAt:
>>>
>>>
>>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> Sent from:
>>> http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>>>
>>>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20191009/30276040/attachment-0001.html>
More information about the Vm-dev
mailing list