[Vm-dev] corruption of PC in context objects or not (?)

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Tue Sep 22 17:18:25 UTC 2020


If you are on linux, you could try using the recording debugger (rr)...
Though, debugging a -O2 or -O3 is hard... No obvious mapping of source code
with assembly instructions. It's better if you can reproduce on debug
version...

Le mar. 22 sept. 2020 à 14:21, Andrei Chis <chisvasileandrei at gmail.com> a
écrit :

>
> Hi Eliot,
>
> Unfortunately with the assert vm with memory leak checks it seems I cannot
> reproduce the problem.
> I’ll try to run it a few more times, maybe I get lucky and get a crash (a
> run takes a few hours).
>
> For our case, calling `self pc` before copying a context really helped.
> Before almost 50% of our builds were failing; now we get no more failures
> at all.
> Thanks for your insights!
>
> Are there other ways to get relevant/helpful details from a crash with a
> normal vm?
>
> Cheers,
> Andrei
>
>
> On 15 Sep 2020, at 01:27, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> Hi Andrei,
>
>
> On Sep 14, 2020, at 3:22 PM, Andrei Chis <chisvasileandrei at gmail.com>
> wrote:
>
> Hi Eliot,
>
> The setup in GT is a bit customised (some changes in the headless vm, some
> custom plugins, custom rendering) so I first thought it will be impossible
> to reproduce the bug in a more standard manner.
> However turns out it is possible. If I use the following script after
> running the tests a few times in lldb I get the crash starting from a plain
> Pharo 8 image.
>
> $ curl https://get.pharo.org/64/80+vm | bash
> $ curl -L
> https://raw.githubusercontent.com/feenkcom/gtoolkit/master/scripts/localbuild/loadgt.st
>  -o loadgt.st
> $ ./pharo Pharo.image st --quit  <http://loadgt.st/>
>
> $ lldb ./pharo-vm/Pharo.app/Contents/MacOS/Pharo
> (lldb) run --headless Pharo.image examples --junit-xml-output
> 'GToolkit-.*' 'GT4SmaCC-.*' 'DeepTraverser-.*' Brick 'Brick-.*' Bloc
> 'Bloc-.*' 'Sparta-.*'
>
>
> I also tried to compile the vm myself on Mac (Catalina 10.15.6). I build a
> normal and assert for https://github.com/OpenSmalltalk/opensmalltalk-vm
>  and https://github.com/pharo-project/opensmalltalk-vm from the cog
> branch.
> In both cases I get an issue related to pixman 0.34.0 [1] but that’s easy
> to workaround. For https://github.com/OpenSmalltalk/opensmalltalk-vm I
> got an extra problem related to Cairo [2] and had to change libpng
> from libpng16 to libpng12 to get it to work.
>
> With both the normal VMs I could reproduce the bug and got stacks with the
> Context>copyTo: messages.
>
> With the assert VMs I only got a crash for now with the assert vm from
> https://github.com/pharo-project/opensmalltalk-vm. However there is no
> Context>copyTo: and the memory seems quite corrupted.
> I suspect the crash also appears in
> https://github.com/OpenSmalltalk/opensmalltalk-vm  but seems that with
> the assert vm it is much harder to reproduce. Had to run the tests 20 times
> and got one crash; running the tests once take 20-30 minutes.
>
>
> This is from only crash until now with the assert vm. Not sure if they are
> helpful or not, or actually related to the problem.
>
> validInstructionPointerinFrame(GIV(instructionPointer), GIV(framePointer))
> 18471
> Pharo was compiled with optimization - stepping may behave oddly;
> variables may not be available.
> Process 73731 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
> (code=2, address=0x157800000)
>     frame #0: 0x0000000100015837 Pharo`longAtPointerput(ptr="????",
> val=5513312480) at sqMemoryAccess.h:142:84 [opt]
>    139    static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((
> int *)ptr)); }
>    140    static inline sqInt intAtPointerput(char *ptr, int val) { return
>  (sqInt)(*((int *)ptr)= val); }
>    141    static inline sqInt longAtPointer(char *ptr) { return *(sqInt
> *)ptr; }
> -> 142    static inline sqInt longAtPointerput(char *ptr, sqInt val) {
> return *(sqInt *)ptr= val; }
>    143    static inline sqLong long64AtPointer(char *ptr) { return *(sqLong
> *)ptr; }
>    144    static inline sqLong long64AtPointerput(char *ptr, sqLong val) {
>  return *(sqLong *)ptr= val; }
>    145    static inline float singleFloatAtPointer(char *ptr) { return *(
> float *)ptr; }
> Target 0: (Pharo) stopped.
>
>
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
> (code=2, address=0x157800000)
>   * frame #0: 0x0000000100015837 Pharo`longAtPointerput(ptr="????",
> val=5513312480) at sqMemoryAccess.h:142:84 [opt]
>     frame #1: 0x00000001000161cf Pharo`marryFrameSP(theFP=<unavailable>,
> theSP=0x0000000000000000) at gcc3x-cointerp.c:68120:3[opt]
>     frame #2: 0x000000010001f5ac Pharo`ceContextinstVar(maybeContext=5510359872,
> slotIndex=0) at gcc3x-cointerp.c:15221:12 [opt]
>     frame #3: 0x00000001480017d6
>     frame #4: 0x00000001000022be Pharo`interpret at gcc3x-cointerp.c:2755:
> 3 [opt]
>     frame #5: 0x00000001000bc244 Pharo`-[sqSqueakMainApplication
> runSqueak](self=0x0000000101c76dc0, _cmd=<unavailable>) at
> sqSqueakMainApplication.m:201:2 [opt]
>     frame #6: 0x00007fff3326729b Foundation`__NSFirePerformWithOrder + 360
>     frame #7: 0x00007fff30ad3335 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
> + 23
>     frame #8: 0x00007fff30ad3267 CoreFoundation`__CFRunLoopDoObservers +
> 457
>     frame #9: 0x00007fff30ad2805 CoreFoundation`__CFRunLoopRun + 874
>     frame #10: 0x00007fff30ad1e3e CoreFoundation`CFRunLoopRunSpecific +
> 462
>     frame #11: 0x00007fff2f6feabd HIToolbox`RunCurrentEventLoopInMode +
> 292
>     frame #12: 0x00007fff2f6fe6f4 HIToolbox`ReceiveNextEventCommon + 359
>     frame #13: 0x00007fff2f6fe579 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter
> + 64
>     frame #14: 0x00007fff2dd44039 AppKit`_DPSNextEvent + 883
>     frame #15: 0x00007fff2dd42880 AppKit`-[NSApplication(NSEvent)
> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1352
>     frame #16: 0x00007fff2dd3458e AppKit`-[NSApplication run] + 658
>     frame #17: 0x00007fff2dd06396 AppKit`NSApplicationMain + 777
>     frame #18: 0x00007fff6ab3ecc9 libdyld.dylib`start + 1
>
>
> (lldb) call printCallStack()
>     0x7ffeefbe3920 M INVALID RECEIVER>(nil) 0x148716b40: a(n) bad class
>     0x7ffeefbe3968 M [] in INVALID RECEIVER>(nil)
> Context(Object)>>doesNotUnderstand: #bounds
>  0x194648118: a(n) bad class
>     0x7ffeefbe39a8 M INVALID RECEIVER>(nil) 0x1489fcec0: a(n) bad class
>     0x7ffeefbe39e8 M INVALID RECEIVER>(nil)  0x1489fcec0: a(n) bad class
>     0x7ffeefbe3a30 I INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>     0x7ffeefbe3a80 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbe3ab8 M INVALID RECEIVER>(nil)  0x148163cd0: a(n) bad class
>     0x7ffeefbe3b08 I INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>     0x7ffeefbe3b40 M INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>     0x7ffeefbe3b78 M INVALID RECEIVER>(nil)  0x1481634e0: a(n) bad class
>     0x7ffeefbe3bc0 I INVALID RECEIVER>(nil)  0x148716a38: a(n) bad class
>     0x7ffeefbe3c10 I INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>     0x7ffeefbe3c40 M INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>     0x7ffeefbe3c78 M INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>     0x7ffeefbe3cc0 M INVALID RECEIVER>(nil)  0x14d0337f0: a(n) bad class
>     0x7ffeefbe3d08 M INVALID RECEIVER>(nil)  0x14d033738: a(n) bad class
>     0x7ffeefbe3d50 M INVALID RECEIVER>(nil)  0x14d033680: a(n) bad class
>     0x7ffeefbe3d98 M INVALID RECEIVER>(nil)  0x1946493f0: a(n) bad class
>     0x7ffeefbe3de0 M INVALID RECEIVER>(nil)  0x194649338: a(n) bad class
>     0x7ffeefbe3e28 M INVALID RECEIVER>(nil)  0x194649280: a(n) bad class
>     0x7ffeefbe3e70 M INVALID RECEIVER>(nil)  0x1946491c8: a(n) bad class
>     0x7ffeefbe3eb8 M INVALID RECEIVER>(nil)  0x194649110: a(n) bad class
>     0x7ffeefbec768 M INVALID RECEIVER>(nil)  0x194649038: a(n) bad class
>     0x7ffeefbec7b0 M INVALID RECEIVER>(nil)  0x194648f60: a(n) bad class
>     0x7ffeefbec7f8 M INVALID RECEIVER>(nil)  0x194648e88: a(n) bad class
>     0x7ffeefbec840 M INVALID RECEIVER>(nil)  0x194648dd0: a(n) bad class
>     0x7ffeefbec888 M INVALID RECEIVER>(nil)  0x194648d18: a(n) bad class
>     0x7ffeefbec8d0 M INVALID RECEIVER>(nil)  0x194648c60: a(n) bad class
>     0x7ffeefbec918 M INVALID RECEIVER>(nil)  0x194648b88: a(n) bad class
>     0x7ffeefbec960 M INVALID RECEIVER>(nil)  0x194648ad0: a(n) bad class
>     0x7ffeefbec9a8 M INVALID RECEIVER>(nil)  0x194648a18: a(n) bad class
>     0x7ffeefbec9f0 M INVALID RECEIVER>(nil)  0x194648960: a(n) bad class
>     0x7ffeefbeca38 M INVALID RECEIVER>(nil)  0x1946488a8: a(n) bad class
>     0x7ffeefbeca80 M INVALID RECEIVER>(nil)  0x1946487f0: a(n) bad class
>     0x7ffeefbecac8 M INVALID RECEIVER>(nil)  0x194648708: a(n) bad class
>     0x7ffeefbecb10 M INVALID RECEIVER>(nil)  0x194648620: a(n) bad class
>     0x7ffeefbecb58 M INVALID RECEIVER>(nil)  0x194648508: a(n) bad class
>     0x7ffeefbecba0 M INVALID RECEIVER>(nil)  0x194648450: a(n) bad class
>     0x7ffeefbecbe8 M INVALID RECEIVER>(nil)  0x1481641a8: a(n) bad class
>     0x7ffeefbecc30 M INVALID RECEIVER>(nil)  0x1481640f0: a(n) bad class
>     0x7ffeefbecc78 M INVALID RECEIVER>(nil)  0x148164038: a(n) bad class
>     0x7ffeefbeccc0 M INVALID RECEIVER>(nil)  0x148163f80: a(n) bad class
>     0x7ffeefbecd08 M INVALID RECEIVER>(nil)  0x148163ec8: a(n) bad class
>     0x7ffeefbecd50 M INVALID RECEIVER>(nil)  0x148163e10: a(n) bad class
>     0x7ffeefbecd98 M INVALID RECEIVER>(nil)  0x148163d28: a(n) bad class
>     0x7ffeefbecde0 M INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>     0x7ffeefbece28 M INVALID RECEIVER>(nil)  0x148163b38: a(n) bad class
>     0x7ffeefbece70 M INVALID RECEIVER>(nil)  0x148163a80: a(n) bad class
>     0x7ffeefbeceb8 M INVALID RECEIVER>(nil)  0x1481639c8: a(n) bad class
>     0x7ffeefbe7758 M INVALID RECEIVER>(nil)  0x1481638e0: a(n) bad class
>     0x7ffeefbe77a0 M INVALID RECEIVER>(nil)  0x148163808: a(n) bad class
>     0x7ffeefbe77e8 M INVALID RECEIVER>(nil)  0x148163750: a(n) bad class
>     0x7ffeefbe7830 M INVALID RECEIVER>(nil)  0x148163698: a(n) bad class
>     0x7ffeefbe7878 M INVALID RECEIVER>(nil)  0x1481635c0: a(n) bad class
>     0x7ffeefbe78c0 M INVALID RECEIVER>(nil)  0x1481634e0: a(n) bad class
>     0x7ffeefbe7908 M INVALID RECEIVER>(nil)  0x148163408: a(n) bad class
>     0x7ffeefbe7950 M INVALID RECEIVER>(nil)  0x148163350: a(n) bad class
>     0x7ffeefbe7998 M INVALID RECEIVER>(nil)  0x148163298: a(n) bad class
>     0x7ffeefbe79e0 M INVALID RECEIVER>(nil)  0x148163188: a(n) bad class
>     0x7ffeefbe7a28 M INVALID RECEIVER>(nil)  0x148163098: a(n) bad class
>     0x7ffeefbe7a70 M INVALID RECEIVER>(nil)  0x148162fa0: a(n) bad class
>     0x7ffeefbe7ab8 M INVALID RECEIVER>(nil)  0x148162ec8: a(n) bad class
>     0x7ffeefbe7b00 M INVALID RECEIVER>(nil)  0x148162e10: a(n) bad class
>     0x7ffeefbe7b48 M INVALID RECEIVER>(nil)  0x148712a08: a(n) bad class
>     0x7ffeefbe7b90 M INVALID RECEIVER>(nil)  0x148712950: a(n) bad class
>     0x7ffeefbe7bd8 M INVALID RECEIVER>(nil)  0x148712898: a(n) bad class
>     0x7ffeefbe7c20 M INVALID RECEIVER>(nil)  0x148713cc0: a(n) bad class
>     0x7ffeefbe7c68 M INVALID RECEIVER>(nil)  0x148713018: a(n) bad class
>     0x7ffeefbe7cb0 M INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>     0x7ffeefbe7cf8 M INVALID RECEIVER>(nil)  0x148713140: a(n) bad class
>     0x7ffeefbe7d40 M INVALID RECEIVER>(nil)  0x148713928: a(n) bad class
>     0x7ffeefbe7d88 M INVALID RECEIVER>(nil)  0x1487133c8: a(n) bad class
>     0x7ffeefbe7de0 I INVALID RECEIVER>(nil)  0x148713238: a(n) bad class
>     0x7ffeefbe7e28 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>     0x7ffeefbe7e70 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>     0x7ffeefbe7eb8 I INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>     0x7ffeefbeea68 I INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>     0x7ffeefbeeac8 M [] in INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad
> class
>     0x7ffeefbeeb00 M INVALID RECEIVER>(nil)  0x148713108: a(n) bad class
>     0x7ffeefbeeb50 I INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>     0x7ffeefbeeb98 I INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>     0x7ffeefbeebe0 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>     0x7ffeefbeec10 M INVALID RECEIVER>(nil) 0x9=1
>     0x7ffeefbeec48 M INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>     0x7ffeefbeeca0 M [] in INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad
> class
>     0x7ffeefbeecd0 M INVALID RECEIVER>(nil)  0x1487130d0: a(n) bad class
>     0x7ffeefbeed10 M INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>     0x7ffeefbeed58 M INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>     0x7ffeefbeedb0 I INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>     0x7ffeefbeedf0 I INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>     0x7ffeefbeee20 M [] in INVALID RECEIVER>(nil)  0x148162ce8: a(n) bad
> class
>     0x7ffeefbeee78 M INVALID RECEIVER>(nil)  0x148162df0: a(n) bad class
>     0x7ffeefbeeec0 M INVALID RECEIVER>(nil)  0x148162ce8: a(n) bad class
>     0x7ffeefbe5978 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe59d8 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
> class
>     0x7ffeefbe5a18 M INVALID RECEIVER>(nil)  0x148163150: a(n) bad class
>     0x7ffeefbe5a68 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5aa0 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5ad8 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
> class
>     0x7ffeefbe5b08 M INVALID RECEIVER>(nil)  0x1481634c0: a(n) bad class
>     0x7ffeefbe5b48 M INVALID RECEIVER>(nil)  0x14c403ca8: a(n) bad class
>     0x7ffeefbe5b88 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5bc0 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5bf0 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
> class
>     0x7ffeefbe5c20 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5c68 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>     0x7ffeefbe5c98 M INVALID RECEIVER>(nil)  0x194656468: a(n) bad class
>     0x7ffeefbe5cd0 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbe5d00 M INVALID RECEIVER>(nil)  0x148163bf0: a(n) bad class
>     0x7ffeefbe5d50 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbe5d88 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>     0x7ffeefbe5dc0 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>     0x7ffeefbe5e00 M INVALID RECEIVER>(nil)  0x148163de0: a(n) bad class
>     0x7ffeefbe5e38 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>     0x7ffeefbe5e80 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>     0x7ffeefbe5eb8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbdf9d8 M INVALID RECEIVER>(nil)  0x194648430: a(n) bad class
>     0x7ffeefbdfa10 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbdfa50 M [] in INVALID RECEIVER>(nil)  0x148a02930: a(n) bad
> class
>     0x7ffeefbdfa90 M INVALID RECEIVER>(nil)  0x1946486d8: a(n) bad class
>     0x7ffeefbdfad0 M INVALID RECEIVER>(nil)  0x148a02930: a(n) bad class
>     0x7ffeefbdfb10 M INVALID RECEIVER>(nil)  0x1946485e0: a(n) bad class
>     0x7ffeefbdfb48 M INVALID RECEIVER>(nil)  0x1489f7da8: a(n) bad class
>     0x7ffeefbdfb80 M INVALID RECEIVER>(nil)  0x148a02930: a(n) bad class
>     0x7ffeefbdfbb8 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>     0x7ffeefbdfbe8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbdfc20 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>     0x7ffeefbdfc58 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>     0x7ffeefbdfc98 M INVALID RECEIVER>(nil)  0x194648c40: a(n) bad class
>     0x7ffeefbdfcc8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbdfd08 M INVALID RECEIVER>(nil)  0x194648f40: a(n) bad class
>     0x7ffeefbdfd40 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
> class
>     0x7ffeefbdfd70 M INVALID RECEIVER>(nil)  0x14902a730: a(n) bad class
>     0x7ffeefbdfdb0 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>     0x7ffeefbdfde8 M INVALID RECEIVER>(nil)  0x14c4033c0: a(n) bad class
>     0x7ffeefbdfe20 M [] in INVALID RECEIVER>(nil)  0x14c4033c0: a(n) bad
> class
>     0x7ffeefbdfe70 M [] in INVALID RECEIVER>(nil)  0x14d032f98: a(n) bad
> class
>     0x7ffeefbdfeb8 M INVALID RECEIVER>(nil)  0x14d032fe0: a(n) bad class
>
> (callerContextOrNil == (nilObject())) || (isContext(callerContextOrNil))
> 72783
>        0x14d033738 is not a context
>
>
> OK, interesting.  Both the assert failure and the badly corrupted stack
> trace lead me to believe that the issue happens long before the crash and
> is probably a stack corruption, either by a primitive cutting back the
> stack incorrectly, or some other hot riot ion (for example are all those
> nils in INVALID RECEIVER>(nil) real or an artifact of attempting to print
> an invalid value?).
>
> So the next step is to run the asset vm with leak checking turned on.  Use
>
> myvm —leakcheck 3 to check after every GC
>
> We can add, eg leak checking after an FFI call, in an afternoon
>
> A more realistic setup would be to run GT with an assert headless vm. But
> until now I did not figure out how to build an assert vm for the
> gt-headless branch from https://github.com/feenkcom/opensmalltalk-vm.
>
> Cheers,
> Andrei
>
>
> [1] https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/258
>
> [2] checking for cairo's PNG functions feature...
> configure: WARNING: Could not find libpng in the pkg-config search path
> checking whether cairo's PNG functions feature could be enabled... no
> configure: error: recommended PNG functions feature could not be enabled
>
> On 14 Sep 2020, at 17:32, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> Hi Andrei,
>
>
> On Sep 14, 2020, at 7:15 AM, Andrei Chis <chisvasileandrei at gmail.com>
> wrote:
>
> Hi Eliot,
>
> On 12 Sep 2020, at 01:42, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> Hi Andrei,
>
> On Fri, Sep 11, 2020 at 11:48 AM Andrei Chis <chisvasileandrei at gmail.com>
> wrote:
>
>>
>> Hi Eliot,
>>
>> Thanks for the answer. That helps to understand what is going on and it
>> can explain why just adding a call to `self pc` makes the crash disappear.
>>
>> Just what was maybe not obvious in my previous email is that we get this
>> problem more or less randomly. We have tests for verifying that tools work
>> when various extensions raise exceptions (these tests copy the stack).
>> Sometimes they work correctly and sometimes they crash. These crashes
>> happen in various tests and until now the only common thing we noticed is
>> that the pc of the contexts where the crash happens looks off. Also the
>> contexts in which this happens are at the beginning of the stack so part of
>> a long computation (it gets copied multiple times).
>>
>> Initially we suspected that there is some memory corruption somewhere due
>> to external calls/memory. Just the fact that calling `self pc` before seems
>> to fix the issue reduces those chances. But who knows.
>>
>
> Well, it does look like a VM bug.  The VM is somehow failing to intercept
> some access, perhaps in shallow copy.  Weird.  I shall try and reproduce.
> Is there anything special about the process you copy using copyTo: ?
>
>
> I don’t think there is something special about that process. It is the
> process that we start to run tests [1]. The exception happens in the
> running process and the crash is when copying the stack of that running
> process.
>
>
> Ok, cool.  What I’d like to do is get a copy of your test setup and run it
> in an assert vm to try and get more information.  AFAICT the vm code is
> good do the bug is not obvious.  An assert vm may give more information
> before the crash.  Have you tried running the system on an assert vm yet?
>
> Checked some previous logs and we get these kinds of crashes on the CI
> server since at least two years. So it does not look like a new bug (but
> who knows).
>
>
> (see below)
>
> On Fri, Sep 11, 2020 at 6:36 PM Eliot Miranda <eliot.miranda at gmail.com>
>> wrote:
>>
>>>
>>> Hi Andrei,
>>>
>>> On Fri, Sep 11, 2020 at 8:58 AM Andrei Chis <chisvasileandrei at gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> We are getting often crashes on our CI when calling `Context>copyTo:`
>>>> in a GT image and a vm build from
>>>> https://github.com/feenkcom/opensmalltalk-vm.
>>>>
>>>> To sum up during `Context>copyTo:`, `Object>>#copy` is called on a
>>>> context leading to a segmentation fault crash. Looking at that context in
>>>> lldb the pc looks off.  It has the value `0xfffffffffea7f6e1`.
>>>>
>>>>  (lldb) call (void *) printOop(0x1206b6990)
>>>>     0x1206b6990: a(n) Context
>>>>      0x1206b6a48 0xfffffffffea7f6e1                0x9        0x1146b2e08        0x1206b6b00
>>>>      0x1206b6b28        0x1206b6b50
>>>>
>>>>
>>>> Can this indicate some corruption or is it expected to have such
>>>> values? `CoInterpreter>>ensureContextHasBytecodePC:` has code that also
>>>> handles negative values for the pc which suggests that this might be
>>>> expected.
>>>>
>>>
>>> The issue is that that value is expected *inside* the VM.  It is the
>>> frame pointer for the context.  But above the Vm this value should be
>>> hidden. The VM should intercept all accesses to such fields in contexts and
>>> automatically map them back to the appropriate values that the image
>>> expects to see.  [The same thing is true for CompiledMethods; inside the VM
>>> methods may refer to their JITted code, but this is invisible from the
>>> image].  Intercepting access to Context state already happens with inst var
>>> access in methods, with the shallowCopy primitive, with instVarAt: et al,
>>> etc.
>>>
>>> So I expect the issue here is that copyTo: invokes some primitive which
>>> does not (yet) check for a context receiver and/or argument, and hence
>>> accidentally it reveals the hidden state to the image and a crash results.
>>> What I need to know are the definitions for copyTo: and copy, etc all the
>>> way down to primitives.
>>>
>>
>> Here is the source code:
>>
>
> Cool, nothing unusual here.  This should all work perfectly.  Tis a VM
> bug. However...
>
>
>> Context >> copyTo: aContext
>> "Copy self and my sender chain down to, but not including, aContext.  End
>> of copied chain will have nil sender."
>>     | copy |
>>     self == aContext ifTrue: [^ nil].
>>     copy := self copy.
>>     self sender ifNotNil: [
>>         copy privSender: (self sender copyTo: aContext)].
>>     ^ copy
>>
>
> Let me suggest
>
> Context >> copyTo: aContext
>    "Copy self and my sender chain down to, but not including, aContext.
> End of copied chain will have nil sender."
>     | copy |
>     self == aContext ifTrue: [^ nil].
>     copy := self copy.
>     self sender ifNotNil:
>         [:mySender| copy privSender: (mySender copyTo: aContext)].
>     ^ copy
>
>
> Nice!
>
> I also tried the non-recursive implementation of Context>>#copyTo: from
> Squeak and it also crashes.
>
> Not sure if related but now in the same image as before I got a different
> crash and printing the stack does not work. But this time the error seems
> to come from handleStackOverflow
>
> (lldb) call (void *)printCallStack()
> invalid frame pointer
> invalid frame pointer
> invalid frame pointer
> error: Execution was interrupted, reason: EXC_BAD_ACCESS
> (code=EXC_I386_GPFLT).
> The process has been returned to the state before expression evaluation.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
> (code=2, address=0x121e00000)
>   * frame #0: 0x0000000100162258 libGlamorousToolkitVMCore.dylib`marryFrameSP
> + 584
>     frame #1: 0x0000000100172982 libGlamorousToolkitVMCore.dylib`handleStackOverflow
> + 354
>     frame #2: 0x000000010016b025 libGlamorousToolkitVMCore.dylib`ceStackOverflow
> + 149
>     frame #3: 0x00000001100005b3
>     frame #4: 0x0000000100174d99 libGlamorousToolkitVMCore.dylib`ptEnterInterpreterFromCallback
> + 73
>
>
> Cheers,
> Andrei
>
> [1] ./GlamorousToolkit.app/Contents/MacOS/GlamorousToolkit  Pharo.image
> examples --junit-xml-output 'GToolkit-.*' 'GT4SmaCC-.*' 'DeepTraverser-.*'
> Brick 'Brick-.*' Bloc 'Bloc-.*' 'Sparta-.*'
>
>
>
> Object>>#copy
>>      ^self shallowCopy postCopy
>>
>> Object >> shallowCopy
>>     | class newObject index |
>>     <primitive: 148>
>>     class := self class.
>>     class isVariable
>>         ifTrue:
>>             [index := self basicSize.
>>             newObject := class basicNew: index.
>>             [index > 0]
>>                 whileTrue:
>>                     [newObject basicAt: index put: (self basicAt: index).
>>                     index := index - 1]]
>>         ifFalse: [newObject := class basicNew].
>>     index := class instSize.
>>     [index > 0]
>>         whileTrue:
>>             [newObject instVarAt: index put: (self instVarAt: index).
>>             index := index - 1].
>>     ^ newObject
>>
>> The code of the primitiveClone looks the same [1]
>>
>>
>>> Changing `Context>copyTo:` by adding a `self pc` before calling `self
>>>> copy` leads to no more crashes. Not sure if there is a reason for that or
>>>> just plain luck.
>>>>
>>>> A simple reduced stack is below (more details in this issue [1]). The
>>>> crash happens always with contexts reified as objects (in this case
>>>> 0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages).
>>>> Could this suggest some kind of issue in the vm when reifying contexts,
>>>> or just some other problem with memory corruption?
>>>>
>>>
>>> This looks like an oversight in some primitive.  Here for example is the
>>> implementation of the shallowCopy primitive, a.k.a. clone, and you can see
>>> where it explcitly intercepts access to a context.
>>>
>>> primitiveClone
>>> "Return a shallow copy of the receiver.
>>>  Special-case non-single contexts (because of context-to-stack mapping).
>>>  Can't fail for contexts cuz of image context instantiation code
>>> (sigh)."
>>>
>>> | rcvr newCopy |
>>> rcvr := self stackTop.
>>> (objectMemory isImmediate: rcvr)
>>> ifTrue:
>>> [newCopy := rcvr]
>>> ifFalse:
>>> [(objectMemory isContextNonImm: rcvr)
>>> ifTrue:
>>> [newCopy := self cloneContext: rcvr]
>>> ifFalse:
>>> [(argumentCount = 0
>>>   or: [(objectMemory isForwarded: rcvr) not])
>>> ifTrue: [newCopy := objectMemory clone: rcvr]
>>> ifFalse: [newCopy := 0]].
>>> newCopy = 0 ifTrue:
>>> [^self primitiveFailFor: PrimErrNoMemory]].
>>> self pop: argumentCount + 1 thenPush: newCopy
>>>
>>> But since Squeak doesn't have copyTo: I have no idea what primitive is
>>> being used.  I'm guessing 168 primitiveCopyObject, which seems to check for
>>> a Context receiver, but not for a CompiledCode receiver.  What does the
>>> primitive failure code look like?  Can you post the copyTo: implementations
>>> here please?
>>>
>>
>> The code is above. I also see Context>>#copyTo: in Squeak calling also
>> Object>>copy for contexts.
>>
>> When a crash happens we don't get the exact same error all the time. For
>> example we get most often on mac:
>>
>> Process 35690 stopped
>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
>> (code=EXC_I386_GPFLT)
>>     frame #0: 0x00000001100b1004
>> ->  0x1100b1004: inl    $0x4c, %eax
>>     0x1100b1006: leal   -0x5c(%rip), %eax
>>     0x1100b100c: pushq  %r8
>>     0x1100b100e: movabsq $0x1109e78e0, %r9         ; imm = 0x1109E78E0
>> Target 0: (GlamorousToolkit) stopped.
>>
>>
>> Process 29929 stopped
>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT
>> (code=EXC_I386_BPT, subcode=0x0)
>>     frame #0: 0x00000001100fe7ed
>> ->  0x1100fe7ed: int3
>>     0x1100fe7ee: int3
>>     0x1100fe7ef: int3
>>     0x1100fe7f0: int3
>> Target 0: (GlamorousToolkit) stopped.
>>
>>
>> [1]
>> https://github.com/feenkcom/opensmalltalk-vm/blob/5f7d49227c9599a35fcb93892b727c93a573482c/smalltalksrc/VMMaker/StackInterpreterPrimitives.class.st#L325
>>
>> Cheers,
>> Andrei
>>
>>
>>>
>>>  0x7ffeefbb4380 M Context(Object)>copy 0x1206b6990: a(n) Context
>>>>     0x7ffeefbb43b8 M Context>copyTo: 0x1206b6990: a(n) Context
>>>>     0x7ffeefbb4400 M Context>copyTo: 0x1206b5ae0: a(n) Context
>>>>   ...
>>>>     0x7ffeefba6078 M Context>copyTo: 0x110548b28: a(n) Context
>>>>     0x7ffeefba60d0 I Context>copyTo: 0x110548a70: a(n) Context
>>>>     0x7ffeefba6118 I MessageNotUnderstood(Exception)>freezeUpTo: 0x110548a20: a(n) MessageNotUnderstood
>>>>     0x7ffeefba6160 I MessageNotUnderstood(Exception)>freeze 0x110548a20: a(n) MessageNotUnderstood
>>>>     0x7ffeefba6190 M [] in GtExampleEvaluator>result 0x110544fb8: a(n) GtExampleEvaluator
>>>>     0x7ffeefba61c8 M BlockClosure>cull: 0x110545188: a(n) BlockClosure
>>>>     0x7ffeefba6208 M Context>evaluateSignal: 0x110548c98: a(n) Context
>>>>     0x7ffeefba6240 M Context>handleSignal: 0x110548c98: a(n) Context
>>>>     0x7ffeefba6278 M Context>handleSignal: 0x110548be0: a(n) Context
>>>>     0x7ffeefba62b0 M MessageNotUnderstood(Exception)>signal 0x110548a20: a(n) MessageNotUnderstood
>>>>     0x7ffeefba62f0 M GtDummyExamplesWithInheritanceSubclassB(Object)>doesNotUnderstand: exampleH 0x1105487d8: a(n) GtDummyExamplesWithInheritanceSubclassB
>>>>     0x7ffeefba6328 M GtExampleEvaluator>primitiveProcessExample:withEvaluationContext: 0x110544fb8: a(n) GtExampleEvaluator
>>>>  ...
>>>>     0x7ffeefbe64d0 M [] in GtExamplesHDReport class(HDReport class)>runPackages: 0x1145e41c8: a(n) GtExamplesHDReport class
>>>>     0x7ffeefbe6520 M [] in Set>collect: 0x1206b5ab0: a(n) Set
>>>>     0x7ffeefbe6568 M Array(SequenceableCollection)>do: 0x1206b5c50: a(n) Array
>>>>        0x1206b5b98 s Set>collect:
>>>>        0x1206b5ae0 s GtExamplesHDReport class(HDReport class)>runPackages:
>>>>        0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages
>>>>        0x1206b6a48 s BlockClosure>ensure:
>>>>        0x1206b6b68 s UIManager class>nonInteractiveDuring:
>>>>        0x1206b6c48 s GtExamplesCommandLineHandler>runPackages
>>>>        0x1206b6d98 s GtExamplesCommandLineHandler>activate
>>>>        0x1206b75d0 s GtExamplesCommandLineHandler class(CommandLineHandler class)>activateWith:
>>>>        0x1207d2f00 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>>        0x1207e6620 s BlockClosure>on:do:
>>>>        0x1207f7ab8 s PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>>        0x120809d40 s PharoCommandLineHandler(BasicCommandLineHandler)>handleSubcommand
>>>>        0x12082ca60 s PharoCommandLineHandler(BasicCommandLineHandler)>handleArgument:
>>>>        0x120789938 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>>        0x1207a83e0 s BlockClosure>on:do:
>>>>        0x1207b57a0 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>>        0x1207bf830 s [] in BlockClosure>newProcess
>>>>
>>>> Cheers,
>>>> Andrei
>>>>
>>>>
>>>> [1] https://github.com/feenkcom/gtoolkit/issues/1440
>>>>
>>>>
>>>
>>> --
>>> _,,,^..^,,,_
>>> best, Eliot
>>>
>>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>
> _,,,^..^,,,_ (phone)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200922/59289c3b/attachment-0001.html>


More information about the Vm-dev mailing list