[Vm-dev] corruption of PC in context objects or not (?)

Eliot Miranda eliot.miranda at gmail.com
Sat Sep 26 15:04:45 UTC 2020


Hi Andrei,

   I can confirm that it does not reproduce with the assert VM and that it
does reproduce with the production VM.

What I've found so far is that the VM is in machine code executing the
Object>>copy/shallowCopy/postCopy code on a Context, and it has very
recently done a code compaction (the most recent entry in the prim trace
log is **CodeCompaction** preceded by many shallowCopy invocations, i.e.
the VM is looping copying contexts, the shallowCopy primitive needing to
call an interpreter primitive to copy a Context, and has just done a code
compaction).

In the following I use the terms cog method and machine code method
interchangeably.  The JIT creates machine code/cog methods for bytecode
methods that are executed more than once, and the Vm spends most of its
time executing code in these machine code/cog methods.

The VM is in machine code executing an illegal instruction which is
**within the header** of an unrelated machine code method (scanLitWord in
the Squeak Parser).  The instructionPointer (the variable used to hold
instruction pointers at suspension points) is pointing just past the call
of shallowCopy in Object>>copy, consistent with the VM having invoked the
shallowCopy primitive.  However, the call instruction preceding
instructionPointer is an unlinked send, calling the lookup trampoline
ceSend0Args, whereas it should be calling a machine code version of
Object>>#shallowCopy.  I can see that there *are* machine code methods for
#copy (the Object>>copy method itself and two closed PICs for it).

So my working hypothesis is that the method zone compaction code fails to
retain the method referred to by the instructionPointer, reclaiming it
itself.  At the start of code compaction the VM must determine what set of
methods are live, which it does by scanning stack frames.  And then it
reference counts methods accessible from the live method set. And then it
discards the least marked methods until it has freed about 1/3 of machine
code. It then plans the compaction, updating all instruction pointers in
stack frames to point to the compacted methods new locations, and then it
compacts.  But because shallowCopy is a primitive its machine code method
does not occur in any stack frame (it is invoked, returns a copy without
building a frame, and returns that copy).  And there is no reference to the
shallowCopy cog method in the Object>>copy cog method, because
pre-compaction, the call for #copy will be to an Open PIC, because this
send is highly polymorphic

At this point I'm confused as to what is happening and how the native pc
ends up executing bogus code.  The instructionPointer is correctly pointing
after the send of shallowCopy.  It is regrettable that the VM reclaimed the
shallowCopy PIC and the shallowCopy method, but not fatal.  Ah, unless code
compaction is undertaken *within* the shallowCopy primitive!!  That would
break things.

Right, got it.  So the context being copied has a machine code PC, but this
PC is referring to an instruction in a machine code method that has been
long ago reclaimed.  So to create the copy of the Context with a valid
bytecode pc, the innards of the shallowCopy primitive has to JIT the
context's method to be able to compute the correct bytecode PC from the
bytecode PC.  If there is no room in the code zone for another method then
a code compaction must be done.  This code compaction occurs in the middle
of the shallowCopy primitive and because the cog method is not noted by the
compaction machinery as being live it gets reclaimed from under the
shallowCopy interpreter primitive being called from that machine code.  So
when the shallowCopy interpreter primitive returns into machine code it
lands somewhere bogus (somewhere in the header of the scanLitWord cog
method).

Great.  I understand the bug.  I have a way of provoking it (I can force a
code compaction in the middle of shallowCopy on contexts).  So I should be
able to understand and fix the bug very soon.  The only thing I don't
understand yet is why the bug doesn't occur in the assert or debug VMs.

Anyay, thank you for your patient and informative responses Andrei.  I
should have it fixed in a few days.

On Fri, Sep 25, 2020 at 6:26 AM Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi Andrei,
>
>     I encountered the same bug when running the full test suite in
> squeak.  It’s something to do with code compaction.  I made a lot of
> changes earlier this year to accommodate the ARMv8 code generator.  One
> thing to do them would be to use a late 2019 vm and see if the problem goes
> away.
>
> I shall be trying to track it down in the background as I gave other
> priorities.  But my experience is that once a bug is reproducible it didn’t
> take too much effort to fix it.
>
> _,,,^..^,,,_ (phone)
>
> On Sep 23, 2020, at 5:29 AM, Andrei Chis <chisvasileandrei at gmail.com>
> wrote:
>
> 
>
> On 22 Sep 2020, at 19:18, Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
>
> If you are on linux, you could try using the recording debugger (rr)...
> Though, debugging a -O2 or -O3 is hard... No obvious mapping of source code
> with assembly instructions. It's better if you can reproduce on debug
> version...
>
>
> Mostly I’m on mac but might try this if all else fails.
>
>
> Le mar. 22 sept. 2020 à 14:21, Andrei Chis <chisvasileandrei at gmail.com> a
> écrit :
>
>>
>> Hi Eliot,
>>
>> Unfortunately with the assert vm with memory leak checks it seems I
>> cannot reproduce the problem.
>> I’ll try to run it a few more times, maybe I get lucky and get a crash (a
>> run takes a few hours).
>>
>> For our case, calling `self pc` before copying a context really helped.
>> Before almost 50% of our builds were failing; now we get no more failures
>> at all.
>> Thanks for your insights!
>>
>> Are there other ways to get relevant/helpful details from a crash with a
>> normal vm?
>>
>> Cheers,
>> Andrei
>>
>>
>> On 15 Sep 2020, at 01:27, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>
>> Hi Andrei,
>>
>>
>> On Sep 14, 2020, at 3:22 PM, Andrei Chis <chisvasileandrei at gmail.com>
>> wrote:
>>
>> Hi Eliot,
>>
>> The setup in GT is a bit customised (some changes in the headless vm,
>> some custom plugins, custom rendering) so I first thought it will be
>> impossible to reproduce the bug in a more standard manner.
>> However turns out it is possible. If I use the following script after
>> running the tests a few times in lldb I get the crash starting from a plain
>> Pharo 8 image.
>>
>> $ curl https://get.pharo.org/64/80+vm | bash
>> $ curl -L
>> https://raw.githubusercontent.com/feenkcom/gtoolkit/master/scripts/localbuild/loadgt.st
>>  -o loadgt.st
>> $ ./pharo Pharo.image st --quit  <http://loadgt.st/>
>>
>> $ lldb ./pharo-vm/Pharo.app/Contents/MacOS/Pharo
>> (lldb) run --headless Pharo.image examples --junit-xml-output
>> 'GToolkit-.*' 'GT4SmaCC-.*' 'DeepTraverser-.*' Brick 'Brick-.*' Bloc
>> 'Bloc-.*' 'Sparta-.*'
>>
>>
>> I also tried to compile the vm myself on Mac (Catalina 10.15.6). I build
>> a normal and assert for https://github.com/OpenSmalltalk/opensmalltalk-vm
>>  and https://github.com/pharo-project/opensmalltalk-vm from the cog
>> branch.
>> In both cases I get an issue related to pixman 0.34.0 [1] but that’s easy
>> to workaround. For https://github.com/OpenSmalltalk/opensmalltalk-vm I
>> got an extra problem related to Cairo [2] and had to change libpng
>> from libpng16 to libpng12 to get it to work.
>>
>> With both the normal VMs I could reproduce the bug and got stacks with
>> the Context>copyTo: messages.
>>
>> With the assert VMs I only got a crash for now with the assert vm from
>> https://github.com/pharo-project/opensmalltalk-vm. However there is no
>> Context>copyTo: and the memory seems quite corrupted.
>> I suspect the crash also appears in
>> https://github.com/OpenSmalltalk/opensmalltalk-vm  but seems that with
>> the assert vm it is much harder to reproduce. Had to run the tests 20 times
>> and got one crash; running the tests once take 20-30 minutes.
>>
>>
>> This is from only crash until now with the assert vm. Not sure if they
>> are helpful or not, or actually related to the problem.
>>
>> validInstructionPointerinFrame(GIV(instructionPointer),
>> GIV(framePointer)) 18471
>> Pharo was compiled with optimization - stepping may behave oddly;
>> variables may not be available.
>> Process 73731 stopped
>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
>> (code=2, address=0x157800000)
>>     frame #0: 0x0000000100015837 Pharo`longAtPointerput(ptr="????",
>> val=5513312480) at sqMemoryAccess.h:142:84 [opt]
>>    139    static inline sqInt intAtPointer(char *ptr) { return
>> (sqInt)(*((int *)ptr)); }
>>    140    static inline sqInt intAtPointerput(char *ptr, int val) {
>> return (sqInt)(*((int *)ptr)= val); }
>>    141    static inline sqInt longAtPointer(char *ptr) { return *(sqInt
>> *)ptr; }
>> -> 142    static inline sqInt longAtPointerput(char *ptr, sqInt val) {
>> return *(sqInt *)ptr= val; }
>>    143    static inline sqLong long64AtPointer(char *ptr) { return *(sqLong
>> *)ptr; }
>>    144    static inline sqLong long64AtPointerput(char *ptr, sqLong val)
>> { return *(sqLong *)ptr= val; }
>>    145    static inline float singleFloatAtPointer(char *ptr) { return *(
>> float *)ptr; }
>> Target 0: (Pharo) stopped.
>>
>>
>> (lldb) bt
>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
>> (code=2, address=0x157800000)
>>   * frame #0: 0x0000000100015837 Pharo`longAtPointerput(ptr="????",
>> val=5513312480) at sqMemoryAccess.h:142:84 [opt]
>>     frame #1: 0x00000001000161cf Pharo`marryFrameSP(theFP=<unavailable>,
>> theSP=0x0000000000000000) at gcc3x-cointerp.c:68120:3[opt]
>>     frame #2: 0x000000010001f5ac Pharo`ceContextinstVar(maybeContext=5510359872,
>> slotIndex=0) at gcc3x-cointerp.c:15221:12 [opt]
>>     frame #3: 0x00000001480017d6
>>     frame #4: 0x00000001000022be Pharo`interpret at gcc3x-cointerp.c:2755
>> :3 [opt]
>>     frame #5: 0x00000001000bc244 Pharo`-[sqSqueakMainApplication
>> runSqueak](self=0x0000000101c76dc0, _cmd=<unavailable>) at
>> sqSqueakMainApplication.m:201:2 [opt]
>>     frame #6: 0x00007fff3326729b Foundation`__NSFirePerformWithOrder +
>> 360
>>     frame #7: 0x00007fff30ad3335 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
>> + 23
>>     frame #8: 0x00007fff30ad3267 CoreFoundation`__CFRunLoopDoObservers +
>> 457
>>     frame #9: 0x00007fff30ad2805 CoreFoundation`__CFRunLoopRun + 874
>>     frame #10: 0x00007fff30ad1e3e CoreFoundation`CFRunLoopRunSpecific +
>> 462
>>     frame #11: 0x00007fff2f6feabd HIToolbox`RunCurrentEventLoopInMode +
>> 292
>>     frame #12: 0x00007fff2f6fe6f4 HIToolbox`ReceiveNextEventCommon + 359
>>     frame #13: 0x00007fff2f6fe579 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter
>> + 64
>>     frame #14: 0x00007fff2dd44039 AppKit`_DPSNextEvent + 883
>>     frame #15: 0x00007fff2dd42880 AppKit`-[NSApplication(NSEvent)
>> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1352
>>     frame #16: 0x00007fff2dd3458e AppKit`-[NSApplication run] + 658
>>     frame #17: 0x00007fff2dd06396 AppKit`NSApplicationMain + 777
>>     frame #18: 0x00007fff6ab3ecc9 libdyld.dylib`start + 1
>>
>>
>> (lldb) call printCallStack()
>>     0x7ffeefbe3920 M INVALID RECEIVER>(nil) 0x148716b40: a(n) bad class
>>     0x7ffeefbe3968 M [] in INVALID RECEIVER>(nil)
>> Context(Object)>>doesNotUnderstand: #bounds
>>  0x194648118: a(n) bad class
>>     0x7ffeefbe39a8 M INVALID RECEIVER>(nil) 0x1489fcec0: a(n) bad class
>>     0x7ffeefbe39e8 M INVALID RECEIVER>(nil)  0x1489fcec0: a(n) bad class
>>     0x7ffeefbe3a30 I INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>>     0x7ffeefbe3a80 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbe3ab8 M INVALID RECEIVER>(nil)  0x148163cd0: a(n) bad class
>>     0x7ffeefbe3b08 I INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>>     0x7ffeefbe3b40 M INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>>     0x7ffeefbe3b78 M INVALID RECEIVER>(nil)  0x1481634e0: a(n) bad class
>>     0x7ffeefbe3bc0 I INVALID RECEIVER>(nil)  0x148716a38: a(n) bad class
>>     0x7ffeefbe3c10 I INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>>     0x7ffeefbe3c40 M INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>>     0x7ffeefbe3c78 M INVALID RECEIVER>(nil)  0x14d0338e8: a(n) bad class
>>     0x7ffeefbe3cc0 M INVALID RECEIVER>(nil)  0x14d0337f0: a(n) bad class
>>     0x7ffeefbe3d08 M INVALID RECEIVER>(nil)  0x14d033738: a(n) bad class
>>     0x7ffeefbe3d50 M INVALID RECEIVER>(nil)  0x14d033680: a(n) bad class
>>     0x7ffeefbe3d98 M INVALID RECEIVER>(nil)  0x1946493f0: a(n) bad class
>>     0x7ffeefbe3de0 M INVALID RECEIVER>(nil)  0x194649338: a(n) bad class
>>     0x7ffeefbe3e28 M INVALID RECEIVER>(nil)  0x194649280: a(n) bad class
>>     0x7ffeefbe3e70 M INVALID RECEIVER>(nil)  0x1946491c8: a(n) bad class
>>     0x7ffeefbe3eb8 M INVALID RECEIVER>(nil)  0x194649110: a(n) bad class
>>     0x7ffeefbec768 M INVALID RECEIVER>(nil)  0x194649038: a(n) bad class
>>     0x7ffeefbec7b0 M INVALID RECEIVER>(nil)  0x194648f60: a(n) bad class
>>     0x7ffeefbec7f8 M INVALID RECEIVER>(nil)  0x194648e88: a(n) bad class
>>     0x7ffeefbec840 M INVALID RECEIVER>(nil)  0x194648dd0: a(n) bad class
>>     0x7ffeefbec888 M INVALID RECEIVER>(nil)  0x194648d18: a(n) bad class
>>     0x7ffeefbec8d0 M INVALID RECEIVER>(nil)  0x194648c60: a(n) bad class
>>     0x7ffeefbec918 M INVALID RECEIVER>(nil)  0x194648b88: a(n) bad class
>>     0x7ffeefbec960 M INVALID RECEIVER>(nil)  0x194648ad0: a(n) bad class
>>     0x7ffeefbec9a8 M INVALID RECEIVER>(nil)  0x194648a18: a(n) bad class
>>     0x7ffeefbec9f0 M INVALID RECEIVER>(nil)  0x194648960: a(n) bad class
>>     0x7ffeefbeca38 M INVALID RECEIVER>(nil)  0x1946488a8: a(n) bad class
>>     0x7ffeefbeca80 M INVALID RECEIVER>(nil)  0x1946487f0: a(n) bad class
>>     0x7ffeefbecac8 M INVALID RECEIVER>(nil)  0x194648708: a(n) bad class
>>     0x7ffeefbecb10 M INVALID RECEIVER>(nil)  0x194648620: a(n) bad class
>>     0x7ffeefbecb58 M INVALID RECEIVER>(nil)  0x194648508: a(n) bad class
>>     0x7ffeefbecba0 M INVALID RECEIVER>(nil)  0x194648450: a(n) bad class
>>     0x7ffeefbecbe8 M INVALID RECEIVER>(nil)  0x1481641a8: a(n) bad class
>>     0x7ffeefbecc30 M INVALID RECEIVER>(nil)  0x1481640f0: a(n) bad class
>>     0x7ffeefbecc78 M INVALID RECEIVER>(nil)  0x148164038: a(n) bad class
>>     0x7ffeefbeccc0 M INVALID RECEIVER>(nil)  0x148163f80: a(n) bad class
>>     0x7ffeefbecd08 M INVALID RECEIVER>(nil)  0x148163ec8: a(n) bad class
>>     0x7ffeefbecd50 M INVALID RECEIVER>(nil)  0x148163e10: a(n) bad class
>>     0x7ffeefbecd98 M INVALID RECEIVER>(nil)  0x148163d28: a(n) bad class
>>     0x7ffeefbecde0 M INVALID RECEIVER>(nil)  0x148163c18: a(n) bad class
>>     0x7ffeefbece28 M INVALID RECEIVER>(nil)  0x148163b38: a(n) bad class
>>     0x7ffeefbece70 M INVALID RECEIVER>(nil)  0x148163a80: a(n) bad class
>>     0x7ffeefbeceb8 M INVALID RECEIVER>(nil)  0x1481639c8: a(n) bad class
>>     0x7ffeefbe7758 M INVALID RECEIVER>(nil)  0x1481638e0: a(n) bad class
>>     0x7ffeefbe77a0 M INVALID RECEIVER>(nil)  0x148163808: a(n) bad class
>>     0x7ffeefbe77e8 M INVALID RECEIVER>(nil)  0x148163750: a(n) bad class
>>     0x7ffeefbe7830 M INVALID RECEIVER>(nil)  0x148163698: a(n) bad class
>>     0x7ffeefbe7878 M INVALID RECEIVER>(nil)  0x1481635c0: a(n) bad class
>>     0x7ffeefbe78c0 M INVALID RECEIVER>(nil)  0x1481634e0: a(n) bad class
>>     0x7ffeefbe7908 M INVALID RECEIVER>(nil)  0x148163408: a(n) bad class
>>     0x7ffeefbe7950 M INVALID RECEIVER>(nil)  0x148163350: a(n) bad class
>>     0x7ffeefbe7998 M INVALID RECEIVER>(nil)  0x148163298: a(n) bad class
>>     0x7ffeefbe79e0 M INVALID RECEIVER>(nil)  0x148163188: a(n) bad class
>>     0x7ffeefbe7a28 M INVALID RECEIVER>(nil)  0x148163098: a(n) bad class
>>     0x7ffeefbe7a70 M INVALID RECEIVER>(nil)  0x148162fa0: a(n) bad class
>>     0x7ffeefbe7ab8 M INVALID RECEIVER>(nil)  0x148162ec8: a(n) bad class
>>     0x7ffeefbe7b00 M INVALID RECEIVER>(nil)  0x148162e10: a(n) bad class
>>     0x7ffeefbe7b48 M INVALID RECEIVER>(nil)  0x148712a08: a(n) bad class
>>     0x7ffeefbe7b90 M INVALID RECEIVER>(nil)  0x148712950: a(n) bad class
>>     0x7ffeefbe7bd8 M INVALID RECEIVER>(nil)  0x148712898: a(n) bad class
>>     0x7ffeefbe7c20 M INVALID RECEIVER>(nil)  0x148713cc0: a(n) bad class
>>     0x7ffeefbe7c68 M INVALID RECEIVER>(nil)  0x148713018: a(n) bad class
>>     0x7ffeefbe7cb0 M INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>>     0x7ffeefbe7cf8 M INVALID RECEIVER>(nil)  0x148713140: a(n) bad class
>>     0x7ffeefbe7d40 M INVALID RECEIVER>(nil)  0x148713928: a(n) bad class
>>     0x7ffeefbe7d88 M INVALID RECEIVER>(nil)  0x1487133c8: a(n) bad class
>>     0x7ffeefbe7de0 I INVALID RECEIVER>(nil)  0x148713238: a(n) bad class
>>     0x7ffeefbe7e28 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>>     0x7ffeefbe7e70 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>>     0x7ffeefbe7eb8 I INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>>     0x7ffeefbeea68 I INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>>     0x7ffeefbeeac8 M [] in INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad
>> class
>>     0x7ffeefbeeb00 M INVALID RECEIVER>(nil)  0x148713108: a(n) bad class
>>     0x7ffeefbeeb50 I INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>>     0x7ffeefbeeb98 I INVALID RECEIVER>(nil)  0x148713480: a(n) bad class
>>     0x7ffeefbeebe0 I INVALID RECEIVER>(nil)  0x1487131f8: a(n) bad class
>>     0x7ffeefbeec10 M INVALID RECEIVER>(nil) 0x9=1
>>     0x7ffeefbeec48 M INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>>     0x7ffeefbeeca0 M [] in INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad
>> class
>>     0x7ffeefbeecd0 M INVALID RECEIVER>(nil)  0x1487130d0: a(n) bad class
>>     0x7ffeefbeed10 M INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>>     0x7ffeefbeed58 M INVALID RECEIVER>(nil)  0x1487123d8: a(n) bad class
>>     0x7ffeefbeedb0 I INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>>     0x7ffeefbeedf0 I INVALID RECEIVER>(nil)  0x1487123c8: a(n) bad class
>>     0x7ffeefbeee20 M [] in INVALID RECEIVER>(nil)  0x148162ce8: a(n) bad
>> class
>>     0x7ffeefbeee78 M INVALID RECEIVER>(nil)  0x148162df0: a(n) bad class
>>     0x7ffeefbeeec0 M INVALID RECEIVER>(nil)  0x148162ce8: a(n) bad class
>>     0x7ffeefbe5978 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe59d8 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
>> class
>>     0x7ffeefbe5a18 M INVALID RECEIVER>(nil)  0x148163150: a(n) bad class
>>     0x7ffeefbe5a68 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5aa0 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5ad8 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
>> class
>>     0x7ffeefbe5b08 M INVALID RECEIVER>(nil)  0x1481634c0: a(n) bad class
>>     0x7ffeefbe5b48 M INVALID RECEIVER>(nil)  0x14c403ca8: a(n) bad class
>>     0x7ffeefbe5b88 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5bc0 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5bf0 M [] in INVALID RECEIVER>(nil)  0x148162f80: a(n) bad
>> class
>>     0x7ffeefbe5c20 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5c68 M INVALID RECEIVER>(nil)  0x148162f80: a(n) bad class
>>     0x7ffeefbe5c98 M INVALID RECEIVER>(nil)  0x194656468: a(n) bad class
>>     0x7ffeefbe5cd0 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbe5d00 M INVALID RECEIVER>(nil)  0x148163bf0: a(n) bad class
>>     0x7ffeefbe5d50 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbe5d88 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>>     0x7ffeefbe5dc0 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>>     0x7ffeefbe5e00 M INVALID RECEIVER>(nil)  0x148163de0: a(n) bad class
>>     0x7ffeefbe5e38 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>>     0x7ffeefbe5e80 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>>     0x7ffeefbe5eb8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbdf9d8 M INVALID RECEIVER>(nil)  0x194648430: a(n) bad class
>>     0x7ffeefbdfa10 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbdfa50 M [] in INVALID RECEIVER>(nil)  0x148a02930: a(n) bad
>> class
>>     0x7ffeefbdfa90 M INVALID RECEIVER>(nil)  0x1946486d8: a(n) bad class
>>     0x7ffeefbdfad0 M INVALID RECEIVER>(nil)  0x148a02930: a(n) bad class
>>     0x7ffeefbdfb10 M INVALID RECEIVER>(nil)  0x1946485e0: a(n) bad class
>>     0x7ffeefbdfb48 M INVALID RECEIVER>(nil)  0x1489f7da8: a(n) bad class
>>     0x7ffeefbdfb80 M INVALID RECEIVER>(nil)  0x148a02930: a(n) bad class
>>     0x7ffeefbdfbb8 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>>     0x7ffeefbdfbe8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbdfc20 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>>     0x7ffeefbdfc58 M INVALID RECEIVER>(nil)  0x1489fcff0: a(n) bad class
>>     0x7ffeefbdfc98 M INVALID RECEIVER>(nil)  0x194648c40: a(n) bad class
>>     0x7ffeefbdfcc8 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbdfd08 M INVALID RECEIVER>(nil)  0x194648f40: a(n) bad class
>>     0x7ffeefbdfd40 M [] in INVALID RECEIVER>(nil)  0x194648118: a(n) bad
>> class
>>     0x7ffeefbdfd70 M INVALID RECEIVER>(nil)  0x14902a730: a(n) bad class
>>     0x7ffeefbdfdb0 M INVALID RECEIVER>(nil)  0x194648118: a(n) bad class
>>     0x7ffeefbdfde8 M INVALID RECEIVER>(nil)  0x14c4033c0: a(n) bad class
>>     0x7ffeefbdfe20 M [] in INVALID RECEIVER>(nil)  0x14c4033c0: a(n) bad
>> class
>>     0x7ffeefbdfe70 M [] in INVALID RECEIVER>(nil)  0x14d032f98: a(n) bad
>> class
>>     0x7ffeefbdfeb8 M INVALID RECEIVER>(nil)  0x14d032fe0: a(n) bad class
>>
>> (callerContextOrNil == (nilObject())) || (isContext(callerContextOrNil))
>> 72783
>>        0x14d033738 is not a context
>>
>>
>> OK, interesting.  Both the assert failure and the badly corrupted stack
>> trace lead me to believe that the issue happens long before the crash and
>> is probably a stack corruption, either by a primitive cutting back the
>> stack incorrectly, or some other hot riot ion (for example are all those
>> nils in INVALID RECEIVER>(nil) real or an artifact of attempting to
>> print an invalid value?).
>>
>> So the next step is to run the asset vm with leak checking turned on.  Use
>>
>> myvm —leakcheck 3 to check after every GC
>>
>> We can add, eg leak checking after an FFI call, in an afternoon
>>
>> A more realistic setup would be to run GT with an assert headless vm. But
>> until now I did not figure out how to build an assert vm for the
>> gt-headless branch from https://github.com/feenkcom/opensmalltalk-vm.
>>
>> Cheers,
>> Andrei
>>
>>
>> [1] https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/258
>>
>> [2] checking for cairo's PNG functions feature...
>> configure: WARNING: Could not find libpng in the pkg-config search path
>> checking whether cairo's PNG functions feature could be enabled... no
>> configure: error: recommended PNG functions feature could not be enabled
>>
>> On 14 Sep 2020, at 17:32, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>
>> Hi Andrei,
>>
>>
>> On Sep 14, 2020, at 7:15 AM, Andrei Chis <chisvasileandrei at gmail.com>
>> wrote:
>>
>> Hi Eliot,
>>
>> On 12 Sep 2020, at 01:42, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>
>> Hi Andrei,
>>
>> On Fri, Sep 11, 2020 at 11:48 AM Andrei Chis <chisvasileandrei at gmail.com>
>> wrote:
>>
>>>
>>> Hi Eliot,
>>>
>>> Thanks for the answer. That helps to understand what is going on and it
>>> can explain why just adding a call to `self pc` makes the crash disappear.
>>>
>>> Just what was maybe not obvious in my previous email is that we get this
>>> problem more or less randomly. We have tests for verifying that tools work
>>> when various extensions raise exceptions (these tests copy the stack).
>>> Sometimes they work correctly and sometimes they crash. These crashes
>>> happen in various tests and until now the only common thing we noticed is
>>> that the pc of the contexts where the crash happens looks off. Also the
>>> contexts in which this happens are at the beginning of the stack so part of
>>> a long computation (it gets copied multiple times).
>>>
>>> Initially we suspected that there is some memory corruption somewhere
>>> due to external calls/memory. Just the fact that calling `self pc` before
>>> seems to fix the issue reduces those chances. But who knows.
>>>
>>
>> Well, it does look like a VM bug.  The VM is somehow failing to intercept
>> some access, perhaps in shallow copy.  Weird.  I shall try and reproduce.
>> Is there anything special about the process you copy using copyTo: ?
>>
>>
>> I don’t think there is something special about that process. It is the
>> process that we start to run tests [1]. The exception happens in the
>> running process and the crash is when copying the stack of that running
>> process.
>>
>>
>> Ok, cool.  What I’d like to do is get a copy of your test setup and run
>> it in an assert vm to try and get more information.  AFAICT the vm code is
>> good do the bug is not obvious.  An assert vm may give more information
>> before the crash.  Have you tried running the system on an assert vm yet?
>>
>> Checked some previous logs and we get these kinds of crashes on the CI
>> server since at least two years. So it does not look like a new bug (but
>> who knows).
>>
>>
>> (see below)
>>
>> On Fri, Sep 11, 2020 at 6:36 PM Eliot Miranda <eliot.miranda at gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi Andrei,
>>>>
>>>> On Fri, Sep 11, 2020 at 8:58 AM Andrei Chis <chisvasileandrei at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are getting often crashes on our CI when calling `Context>copyTo:`
>>>>> in a GT image and a vm build from
>>>>> https://github.com/feenkcom/opensmalltalk-vm.
>>>>>
>>>>> To sum up during `Context>copyTo:`, `Object>>#copy` is called on a
>>>>> context leading to a segmentation fault crash. Looking at that context in
>>>>> lldb the pc looks off.  It has the value `0xfffffffffea7f6e1`.
>>>>>
>>>>>  (lldb) call (void *) printOop(0x1206b6990)
>>>>>     0x1206b6990: a(n) Context
>>>>>      0x1206b6a48 0xfffffffffea7f6e1                0x9        0x1146b2e08        0x1206b6b00
>>>>>      0x1206b6b28        0x1206b6b50
>>>>>
>>>>>
>>>>> Can this indicate some corruption or is it expected to have such
>>>>> values? `CoInterpreter>>ensureContextHasBytecodePC:` has code that also
>>>>> handles negative values for the pc which suggests that this might be
>>>>> expected.
>>>>>
>>>>
>>>> The issue is that that value is expected *inside* the VM.  It is the
>>>> frame pointer for the context.  But above the Vm this value should be
>>>> hidden. The VM should intercept all accesses to such fields in contexts and
>>>> automatically map them back to the appropriate values that the image
>>>> expects to see.  [The same thing is true for CompiledMethods; inside the VM
>>>> methods may refer to their JITted code, but this is invisible from the
>>>> image].  Intercepting access to Context state already happens with inst var
>>>> access in methods, with the shallowCopy primitive, with instVarAt: et al,
>>>> etc.
>>>>
>>>> So I expect the issue here is that copyTo: invokes some primitive which
>>>> does not (yet) check for a context receiver and/or argument, and hence
>>>> accidentally it reveals the hidden state to the image and a crash results.
>>>> What I need to know are the definitions for copyTo: and copy, etc all the
>>>> way down to primitives.
>>>>
>>>
>>> Here is the source code:
>>>
>>
>> Cool, nothing unusual here.  This should all work perfectly.  Tis a VM
>> bug. However...
>>
>>
>>> Context >> copyTo: aContext
>>> "Copy self and my sender chain down to, but not including, aContext.
>>> End of copied chain will have nil sender."
>>>     | copy |
>>>     self == aContext ifTrue: [^ nil].
>>>     copy := self copy.
>>>     self sender ifNotNil: [
>>>         copy privSender: (self sender copyTo: aContext)].
>>>     ^ copy
>>>
>>
>> Let me suggest
>>
>> Context >> copyTo: aContext
>>    "Copy self and my sender chain down to, but not including, aContext.
>> End of copied chain will have nil sender."
>>     | copy |
>>     self == aContext ifTrue: [^ nil].
>>     copy := self copy.
>>     self sender ifNotNil:
>>         [:mySender| copy privSender: (mySender copyTo: aContext)].
>>     ^ copy
>>
>>
>> Nice!
>>
>> I also tried the non-recursive implementation of Context>>#copyTo: from
>> Squeak and it also crashes.
>>
>> Not sure if related but now in the same image as before I got a different
>> crash and printing the stack does not work. But this time the error seems
>> to come from handleStackOverflow
>>
>> (lldb) call (void *)printCallStack()
>> invalid frame pointer
>> invalid frame pointer
>> invalid frame pointer
>> error: Execution was interrupted, reason: EXC_BAD_ACCESS
>> (code=EXC_I386_GPFLT).
>> The process has been returned to the state before expression evaluation.
>> (lldb) bt
>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
>> (code=2, address=0x121e00000)
>>   * frame #0: 0x0000000100162258 libGlamorousToolkitVMCore.dylib`marryFrameSP
>> + 584
>>     frame #1: 0x0000000100172982 libGlamorousToolkitVMCore.dylib`handleStackOverflow
>> + 354
>>     frame #2: 0x000000010016b025 libGlamorousToolkitVMCore.dylib`ceStackOverflow
>> + 149
>>     frame #3: 0x00000001100005b3
>>     frame #4: 0x0000000100174d99 libGlamorousToolkitVMCore.dylib`ptEnterInterpreterFromCallback
>> + 73
>>
>>
>> Cheers,
>> Andrei
>>
>> [1] ./GlamorousToolkit.app/Contents/MacOS/GlamorousToolkit  Pharo.image
>> examples --junit-xml-output 'GToolkit-.*' 'GT4SmaCC-.*' 'DeepTraverser-.*'
>> Brick 'Brick-.*' Bloc 'Bloc-.*' 'Sparta-.*'
>>
>>
>>
>> Object>>#copy
>>>      ^self shallowCopy postCopy
>>>
>>> Object >> shallowCopy
>>>     | class newObject index |
>>>     <primitive: 148>
>>>     class := self class.
>>>     class isVariable
>>>         ifTrue:
>>>             [index := self basicSize.
>>>             newObject := class basicNew: index.
>>>             [index > 0]
>>>                 whileTrue:
>>>                     [newObject basicAt: index put: (self basicAt:
>>> index).
>>>                     index := index - 1]]
>>>         ifFalse: [newObject := class basicNew].
>>>     index := class instSize.
>>>     [index > 0]
>>>         whileTrue:
>>>             [newObject instVarAt: index put: (self instVarAt: index).
>>>             index := index - 1].
>>>     ^ newObject
>>>
>>> The code of the primitiveClone looks the same [1]
>>>
>>>
>>>> Changing `Context>copyTo:` by adding a `self pc` before calling `self
>>>>> copy` leads to no more crashes. Not sure if there is a reason for that or
>>>>> just plain luck.
>>>>>
>>>>> A simple reduced stack is below (more details in this issue [1]). The
>>>>> crash happens always with contexts reified as objects (in this case
>>>>> 0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages).
>>>>> Could this suggest some kind of issue in the vm when reifying
>>>>> contexts, or just some other problem with memory corruption?
>>>>>
>>>>
>>>> This looks like an oversight in some primitive.  Here for example is
>>>> the implementation of the shallowCopy primitive, a.k.a. clone, and you can
>>>> see where it explcitly intercepts access to a context.
>>>>
>>>> primitiveClone
>>>> "Return a shallow copy of the receiver.
>>>>  Special-case non-single contexts (because of context-to-stack
>>>> mapping).
>>>>  Can't fail for contexts cuz of image context instantiation code
>>>> (sigh)."
>>>>
>>>> | rcvr newCopy |
>>>> rcvr := self stackTop.
>>>> (objectMemory isImmediate: rcvr)
>>>> ifTrue:
>>>> [newCopy := rcvr]
>>>> ifFalse:
>>>> [(objectMemory isContextNonImm: rcvr)
>>>> ifTrue:
>>>> [newCopy := self cloneContext: rcvr]
>>>> ifFalse:
>>>> [(argumentCount = 0
>>>>   or: [(objectMemory isForwarded: rcvr) not])
>>>> ifTrue: [newCopy := objectMemory clone: rcvr]
>>>> ifFalse: [newCopy := 0]].
>>>> newCopy = 0 ifTrue:
>>>> [^self primitiveFailFor: PrimErrNoMemory]].
>>>> self pop: argumentCount + 1 thenPush: newCopy
>>>>
>>>> But since Squeak doesn't have copyTo: I have no idea what primitive is
>>>> being used.  I'm guessing 168 primitiveCopyObject, which seems to check for
>>>> a Context receiver, but not for a CompiledCode receiver.  What does the
>>>> primitive failure code look like?  Can you post the copyTo: implementations
>>>> here please?
>>>>
>>>
>>> The code is above. I also see Context>>#copyTo: in Squeak calling also
>>> Object>>copy for contexts.
>>>
>>> When a crash happens we don't get the exact same error all the time. For
>>> example we get most often on mac:
>>>
>>> Process 35690 stopped
>>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
>>> (code=EXC_I386_GPFLT)
>>>     frame #0: 0x00000001100b1004
>>> ->  0x1100b1004: inl    $0x4c, %eax
>>>     0x1100b1006: leal   -0x5c(%rip), %eax
>>>     0x1100b100c: pushq  %r8
>>>     0x1100b100e: movabsq $0x1109e78e0, %r9         ; imm = 0x1109E78E0
>>> Target 0: (GlamorousToolkit) stopped.
>>>
>>>
>>> Process 29929 stopped
>>> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT
>>> (code=EXC_I386_BPT, subcode=0x0)
>>>     frame #0: 0x00000001100fe7ed
>>> ->  0x1100fe7ed: int3
>>>     0x1100fe7ee: int3
>>>     0x1100fe7ef: int3
>>>     0x1100fe7f0: int3
>>> Target 0: (GlamorousToolkit) stopped.
>>>
>>>
>>> [1]
>>> https://github.com/feenkcom/opensmalltalk-vm/blob/5f7d49227c9599a35fcb93892b727c93a573482c/smalltalksrc/VMMaker/StackInterpreterPrimitives.class.st#L325
>>>
>>> Cheers,
>>> Andrei
>>>
>>>
>>>>
>>>>  0x7ffeefbb4380 M Context(Object)>copy 0x1206b6990: a(n) Context
>>>>>     0x7ffeefbb43b8 M Context>copyTo: 0x1206b6990: a(n) Context
>>>>>     0x7ffeefbb4400 M Context>copyTo: 0x1206b5ae0: a(n) Context
>>>>>   ...
>>>>>     0x7ffeefba6078 M Context>copyTo: 0x110548b28: a(n) Context
>>>>>     0x7ffeefba60d0 I Context>copyTo: 0x110548a70: a(n) Context
>>>>>     0x7ffeefba6118 I MessageNotUnderstood(Exception)>freezeUpTo: 0x110548a20: a(n) MessageNotUnderstood
>>>>>     0x7ffeefba6160 I MessageNotUnderstood(Exception)>freeze 0x110548a20: a(n) MessageNotUnderstood
>>>>>     0x7ffeefba6190 M [] in GtExampleEvaluator>result 0x110544fb8: a(n) GtExampleEvaluator
>>>>>     0x7ffeefba61c8 M BlockClosure>cull: 0x110545188: a(n) BlockClosure
>>>>>     0x7ffeefba6208 M Context>evaluateSignal: 0x110548c98: a(n) Context
>>>>>     0x7ffeefba6240 M Context>handleSignal: 0x110548c98: a(n) Context
>>>>>     0x7ffeefba6278 M Context>handleSignal: 0x110548be0: a(n) Context
>>>>>     0x7ffeefba62b0 M MessageNotUnderstood(Exception)>signal 0x110548a20: a(n) MessageNotUnderstood
>>>>>     0x7ffeefba62f0 M GtDummyExamplesWithInheritanceSubclassB(Object)>doesNotUnderstand: exampleH 0x1105487d8: a(n) GtDummyExamplesWithInheritanceSubclassB
>>>>>     0x7ffeefba6328 M GtExampleEvaluator>primitiveProcessExample:withEvaluationContext: 0x110544fb8: a(n) GtExampleEvaluator
>>>>>  ...
>>>>>     0x7ffeefbe64d0 M [] in GtExamplesHDReport class(HDReport class)>runPackages: 0x1145e41c8: a(n) GtExamplesHDReport class
>>>>>     0x7ffeefbe6520 M [] in Set>collect: 0x1206b5ab0: a(n) Set
>>>>>     0x7ffeefbe6568 M Array(SequenceableCollection)>do: 0x1206b5c50: a(n) Array
>>>>>        0x1206b5b98 s Set>collect:
>>>>>        0x1206b5ae0 s GtExamplesHDReport class(HDReport class)>runPackages:
>>>>>        0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages
>>>>>        0x1206b6a48 s BlockClosure>ensure:
>>>>>        0x1206b6b68 s UIManager class>nonInteractiveDuring:
>>>>>        0x1206b6c48 s GtExamplesCommandLineHandler>runPackages
>>>>>        0x1206b6d98 s GtExamplesCommandLineHandler>activate
>>>>>        0x1206b75d0 s GtExamplesCommandLineHandler class(CommandLineHandler class)>activateWith:
>>>>>        0x1207d2f00 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>>>        0x1207e6620 s BlockClosure>on:do:
>>>>>        0x1207f7ab8 s PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>>>        0x120809d40 s PharoCommandLineHandler(BasicCommandLineHandler)>handleSubcommand
>>>>>        0x12082ca60 s PharoCommandLineHandler(BasicCommandLineHandler)>handleArgument:
>>>>>        0x120789938 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>>>        0x1207a83e0 s BlockClosure>on:do:
>>>>>        0x1207b57a0 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>>>        0x1207bf830 s [] in BlockClosure>newProcess
>>>>>
>>>>> Cheers,
>>>>> Andrei
>>>>>
>>>>>
>>>>> [1] https://github.com/feenkcom/gtoolkit/issues/1440
>>>>>
>>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>>
>> _,,,^..^,,,_ (phone)
>>
>>
>>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200926/cb34723b/attachment-0001.html>


More information about the Vm-dev mailing list