[Vm-dev] Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Wed Oct 9 20:59:53 UTC 2019


Hi Eliot,
Yesterday I thought I would opt for 4) because it's TSTTCPW.
Of course it's very hackish and trampoline specific.
But loading stack and frame pointer is very specific.

Le mer. 9 oct. 2019 à 22:10, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :

>
> Hi Nicolas, Clément, et al,
>
> On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
>
>> More on the problem that Eliot is speaking about: it can happen in these
>> conditions:
>> if is SIGIO is delivered while executing any trampoline transiting from
>> SmalltalkToCStackSwitch.
>> What happens here?
>>
>> We must load 64bits contents of memory (cStackPointerAddress) to the
>> stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
>> This is done by using the CogRTLOpcode abstract instruction MoveAwR
>> But we have no matching instruction in IA-32 X64... We can only load a
>> 64bits memory content in $rax
>> So the idea is to generate this sequence
>> (CogX64Compiler>>concretizeMoveAwR):
>>     xchgq  %rsp, %rax
>>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>>     xchgq  %rsp, %rax
>> That's clever because it preserves $rax which could be in use when we
>> want to MoveAwR.
>>
>
> Clever, but in the case of %rsp and %rbp, dead wrong :blush:.  The
> constraint is that x86_64 only provides a 64-bit load from an absolute
> address into %rax; no other register can be used.
>
>
>> But it has an unfortunate side effect: the stack pointer temporarily gets
>> the contents of $rax and can thus temporarily point anywhere.
>>
>
> Right; which is unacceptable.  It's a tiny window, but one we hit all the
> time.
>
>
>> What happens when performing Squeak SocketTest is that we use some
>> trampolines (I guess for invoking primitives for example) and we generate
>> SIGIO (for some reasons, there are a lot of SIGIO generated on my
>> particular macos machine, so I can trigger the bug more easily than Eliot).
>> We previously installed a handler for SIGIO via signal(). When we use
>> signal(), the handler shares the stack pointer with user program.
>> If the event is delivered in between the two xchgq instructions above,
>> the signal handler will then use a corrupted stack pointer pointing
>> anywhere (depending no contents of $rax) when the VM enter the signal
>> handler function, it uses stack pointer to save some states, and corrupt a
>> memory zone, segfault or whatever.
>> In the case described by the opensmalltalk vm-dev thread, $rax was
>> pointing to the generated code zone (jitted methodZone), so we corrupted
>> the generated code and soon get punished for that. But it's probable that
>> there might be other (rare) occurrences of this bug.
>>
>> Not sure if it is causing the bugs described by Sean, but it's important
>> to use the fix from Eliot ASAP and retry.
>> There might be other occurrence of signal(SIGIO,forceInterruptCheck) in
>> minheadless flavour, I did not check if Eliot also corrected it, if not it
>> should also be corrected ASAP, as should every usage of signal() be
>> replaced by ussage of sigaction() with appropriate flags to use
>> sigaltstack() - see Eliot's commit details.
>>
>
> I want to discuss potential fixes with Clément and Nicolas, and anyone
> else interested. So...
>
> 1. the straight forward fix is to generate different code for setting %rsp
> and %rbp.  I shall do this very soon.  On x86_64 we dedicate a register for
> code generation purposes, this is called RISCTempReg, and is either %r8
> (SysV ABI (unix)) or %r11 (Windows).  RISCTempReg is never assumed to be
> live except within a single instruction sequence.  We can swap with this
> before assigning to %rsp, e.g.
>
>     xchgq  %r8, %rax
>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>     xchgq  %r8, %rax
>     movq %r8, %rsp
>
> this adds a couple of bytes, but is reliable.
>
> 2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg
> would be either %r8 or %r11, and RISCTemptReg would be %rax.  That would
> allow
>
>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>     movq %rax, %rsp
>
> this at least worked on Ryan's MIPS32 back end when it was in use (no one
> has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is
> AT (r1) and CResultRegister is V0 (r2).  So this is worth investigating.
>  [I had tried something similar with HPS and it completely broke the code
> generator so my gut is trying to tell me this will never work; Ryan has
> proved otherwise].
>
> 3. I would much rather implement some form of tracking whether a
> particular register is live or not.  A trivial implementation would only
> track TempReg and only track being not live up until the first assignment
> to TempReg.  A more sophisticated approach could deal with control flow
> branching and merging of the liveness info.  I like the trivial approach
> for now.  That would simply track TempReg and only up until the first
> explicit assignment to TempReg.
>
> 4. we could simply treat assignments to SPReg and FPReg specially, knowing
> that these are done only in trampolines and enilopmarts (these are the
> pieces of code that sit b between JITted code and the rest of the run-time
> (interpreter, memory manager, primitives, etc), either to call the runtime
> from JITted code or, as in this case, to enter machine code from the
> run-time.  I suppose this is OK as long as we document it.  This would
> allow us to generate what we would generate for #2 above, i.e.
>
>     movabsq 0x10027c338, %rax ; cStackPointerAddress
>     movq %rax, %rsp
>
> N. are there better ways?
>
> Clément, Nicolas (et al), what do you think?
>
> Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <eliot.miranda at gmail.com> a
> écrit :
>
>> Hi Sean, Hi All,
>>>
>>>     this may be because of the issue described here:
>>> http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html
>>>
>>> This issue is characterized by the system crashing soon after start up
>>> when some significant i/o is done, typically either to files or sockets.
>>> It affects macOS only and may indeed affect only 64-bits.  We have strong
>>> evidence that it is caused by the dynamic linker being invoked in the
>>> signal handler for SIGIO when the signal is delivered while the VM is
>>> executing JITted code.  The symptom that causes the crash is corruption of
>>> a particular jitted method’s machine code, eg Delay class>>#startEventLoop,
>>> and we believe that the corruption is caused by the linker when it
>>> misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack
>>> frame and attempts to scan code to link it.
>>>
>>> Our diagnosis is speculative; this is extremely hard to reproduce.
>>> Typically in repeating a crashing run SIGIO may no longer be delivered at
>>> the same point because any remote server has now woken up and delivers
>>> results sooner, etc.  However, Nicolas Cellier and I are both confident
>>> that we have correctly identified the bug.
>>>
>>> The fix is simple; SIGIO should be delivered on a dedicated signal stack
>>> (see sigaltstack(2)).  I committed a fix yesterday evening and we should
>>> see within a week or so if these crashes have disappeared.
>>>
>>> I encourage the Pharo vm maintainers to build and release vms that
>>> include
>>> https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211
>>>  asap.
>>>
>>>
>>> Cheers,
>>> Eliot
>>> _,,,^..^,,,_ (phone)
>>>
>>> On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <sean at clipperadams.com>
>>> wrote:
>>>
>>> Segmentation fault Thu Oct  3 15:52:33 2019
>>>
>>>
>>> VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>> Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
>>> Plugins: 201901051900
>>> https://github.com/OpenSmalltalk/opensmalltalk-vm.git
>>>
>>> C stack backtrace & registers:
>>>    rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260
>>> rdx
>>> 0x0000000000dd6800
>>>    rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50
>>> rsp
>>> 0x00007ffeebcffe50
>>>    r8  0x00007fff3f2cefe5 r9  0x0000000000000b00 r10 0x0000000000006000
>>> r11
>>> 0xfffffffffcd8d5a0
>>>    r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064
>>> r15
>>> 0x0000000000002800
>>>    rip 0x00007fff630f7d09
>>> 0   libsystem_platform.dylib            0x00007fff630f7d09
>>> _platform_memmove$VARIANT$Haswell + 41
>>> 1   Pharo                               0x0000000103f52642
>>> reportStackState
>>> + 952
>>> 2   Pharo                               0x0000000103f52987 sigsegv + 174
>>> 3   libsystem_platform.dylib            0x00007fff630fab3d _sigtramp + 29
>>> 4   ???                                 0x0000058900000a00 0x0 +
>>> 6085968660992
>>> 5   libGLImage.dylib                    0x00007fff3f2ce29e
>>> glgProcessPixelsWithProcessor + 2149
>>> 6   AMDRadeonX5000GLDriver              0x000000010db16db1
>>> glrATIStoreLevels
>>> + 1600
>>> 7   AMDRadeonX5000GLDriver              0x000000010db52c83
>>> glrAMD_GFX9_LoadSysTextureStandard + 45
>>> 8   AMDRadeonX5000GLDriver              0x000000010db519bb
>>> glrUpdateTexture
>>> + 1346
>>> 9   libGPUSupportMercury.dylib          0x00007fff5181279d
>>> gpusLoadCurrentTextures + 591
>>> 10  AMDRadeonX5000GLDriver              0x000000010db5a099
>>> gldUpdateDispatch
>>> + 397
>>> 11  GLEngine                            0x00007fff3ff72078
>>> gleDoDrawDispatchCore + 629
>>> 12  GLEngine                            0x00007fff3ff16369
>>> glDrawArraysInstanced_STD_Exec + 264
>>> 13  GLEngine                            0x00007fff3ff1625a
>>> glDrawArrays_UnpackThread + 40
>>> 14  GLEngine                            0x00007fff3ff6dce1
>>> gleCmdProcessor +
>>> 77
>>> 15  libdispatch.dylib                   0x00007fff62ec2dcf
>>> _dispatch_client_callout + 8
>>> 16  libdispatch.dylib                   0x00007fff62ecea2c
>>> _dispatch_lane_barrier_sync_invoke_and_complete + 60
>>> 17  GLEngine                            0x00007fff3fec4b85
>>> glFlush_ExecThread + 15
>>> 18  Pharo                               0x0000000103f4cc62
>>> -[sqSqueakOSXOpenGLView drawRect:flush:] + 314
>>> 19  Pharo                               0x0000000103f4cb22 -
>>> ...
>>>
>>> Smalltalk stack dump:
>>>    0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>>    0x7ffeebd14270 M [] in
>>>
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>> 0x10fab3ad0: a(n) DelaySemaphoreScheduler
>>>       0x1125923f8 s BlockClosure>ensure:
>>>       0x111e88d30 s
>>>
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
>>>       0x112590a50 s [] in
>>> DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
>>>       0x111e88e08 s [] in BlockClosure>newProcess
>>>
>>> Most recent primitives
>>> @
>>> actualScreenSize
>>> millisecondClockValue
>>> tempAt:
>>>
>>>
>>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> Sent from:
>>> http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>>>
>>>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20191009/30276040/attachment-0001.html>


More information about the Vm-dev mailing list