[Vm-dev] corruption of PC in context objects or not (?)

Andrei Chis chisvasileandrei at gmail.com
Mon Sep 14 14:15:04 UTC 2020


Hi Eliot,

> On 12 Sep 2020, at 01:42, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> 
> Hi Andrei,
> 
> On Fri, Sep 11, 2020 at 11:48 AM Andrei Chis <chisvasileandrei at gmail.com <mailto:chisvasileandrei at gmail.com>> wrote:
>  
> Hi Eliot,
> 
> Thanks for the answer. That helps to understand what is going on and it can explain why just adding a call to `self pc` makes the crash disappear. 
> 
> Just what was maybe not obvious in my previous email is that we get this problem more or less randomly. We have tests for verifying that tools work when various extensions raise exceptions (these tests copy the stack). Sometimes they work correctly and sometimes they crash. These crashes happen in various tests and until now the only common thing we noticed is that the pc of the contexts where the crash happens looks off. Also the contexts in which this happens are at the beginning of the stack so part of a long computation (it gets copied multiple times).
> 
> Initially we suspected that there is some memory corruption somewhere due to external calls/memory. Just the fact that calling `self pc` before seems to fix the issue reduces those chances. But who knows.
> 
> Well, it does look like a VM bug.  The VM is somehow failing to intercept some access, perhaps in shallow copy.  Weird.  I shall try and reproduce.   Is there anything special about the process you copy using copyTo: ?

I don’t think there is something special about that process. It is the process that we start to run tests [1]. The exception happens in the running process and the crash is when copying the stack of that running process.

Checked some previous logs and we get these kinds of crashes on the CI server since at least two years. So it does not look like a new bug (but who knows).

> 
> (see below)
> 
> On Fri, Sep 11, 2020 at 6:36 PM Eliot Miranda <eliot.miranda at gmail.com <mailto:eliot.miranda at gmail.com>> wrote:
>  
> Hi Andrei,
> 
> On Fri, Sep 11, 2020 at 8:58 AM Andrei Chis <chisvasileandrei at gmail.com <mailto:chisvasileandrei at gmail.com>> wrote:
>  
> Hi,
> 
> We are getting often crashes on our CI when calling `Context>copyTo:` in a GT image and a vm build from https://github.com/feenkcom/opensmalltalk-vm <https://github.com/feenkcom/opensmalltalk-vm>.
> 
> To sum up during `Context>copyTo:`, `Object>>#copy` is called on a context leading to a segmentation fault crash. Looking at that context in lldb the pc looks off.  It has the value `0xfffffffffea7f6e1`.
> 
>  (lldb) call (void *) printOop(0x1206b6990)
>     0x1206b6990: a(n) Context
>      0x1206b6a48 0xfffffffffea7f6e1                0x9        0x1146b2e08        0x1206b6b00 
>      0x1206b6b28        0x1206b6b50 
> 
> Can this indicate some corruption or is it expected to have such values? `CoInterpreter>>ensureContextHasBytecodePC:` has code that also handles negative values for the pc which suggests that this might be expected.
> 
> The issue is that that value is expected *inside* the VM.  It is the frame pointer for the context.  But above the Vm this value should be hidden. The VM should intercept all accesses to such fields in contexts and automatically map them back to the appropriate values that the image expects to see.  [The same thing is true for CompiledMethods; inside the VM methods may refer to their JITted code, but this is invisible from the image].  Intercepting access to Context state already happens with inst var access in methods, with the shallowCopy primitive, with instVarAt: et al, etc.
> 
> So I expect the issue here is that copyTo: invokes some primitive which does not (yet) check for a context receiver and/or argument, and hence accidentally it reveals the hidden state to the image and a crash results.  What I need to know are the definitions for copyTo: and copy, etc all the way down to primitives.
> 
> Here is the source code:
> 
> Cool, nothing unusual here.  This should all work perfectly.  Tis a VM bug. However...
>  
> Context >> copyTo: aContext 
> "Copy self and my sender chain down to, but not including, aContext.  End of copied chain will have nil sender."
>     | copy |
>     self == aContext ifTrue: [^ nil].
>     copy := self copy.
>     self sender ifNotNil: [
>         copy privSender: (self sender copyTo: aContext)].
>     ^ copy
> 
> Let me suggest
> 
> Context >> copyTo: aContext 
>    "Copy self and my sender chain down to, but not including, aContext.  End of copied chain will have nil sender."
>     | copy |
>     self == aContext ifTrue: [^ nil].
>     copy := self copy.
>     self sender ifNotNil:
>         [:mySender| copy privSender: (mySender copyTo: aContext)].
>     ^ copy 

Nice!

I also tried the non-recursive implementation of Context>>#copyTo: from Squeak and it also crashes.

Not sure if related but now in the same image as before I got a different crash and printing the stack does not work. But this time the error seems to come from handleStackOverflow

(lldb) call (void *)printCallStack()
invalid frame pointer
invalid frame pointer
invalid frame pointer
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=EXC_I386_GPFLT).
The process has been returned to the state before expression evaluation.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x121e00000)
  * frame #0: 0x0000000100162258 libGlamorousToolkitVMCore.dylib`marryFrameSP + 584
    frame #1: 0x0000000100172982 libGlamorousToolkitVMCore.dylib`handleStackOverflow + 354
    frame #2: 0x000000010016b025 libGlamorousToolkitVMCore.dylib`ceStackOverflow + 149
    frame #3: 0x00000001100005b3
    frame #4: 0x0000000100174d99 libGlamorousToolkitVMCore.dylib`ptEnterInterpreterFromCallback + 73


Cheers,
Andrei

[1] ./GlamorousToolkit.app/Contents/MacOS/GlamorousToolkit  Pharo.image examples --junit-xml-output 'GToolkit-.*' 'GT4SmaCC-.*' 'DeepTraverser-.*' Brick 'Brick-.*' Bloc 'Bloc-.*' 'Sparta-.*'


> 
> Object>>#copy
>      ^self shallowCopy postCopy
> 
> Object >> shallowCopy
>     | class newObject index |
>     <primitive: 148>
>     class := self class.
>     class isVariable
>         ifTrue: 
>             [index := self basicSize.
>             newObject := class basicNew: index.
>             [index > 0]
>                 whileTrue: 
>                     [newObject basicAt: index put: (self basicAt: index).
>                     index := index - 1]]
>         ifFalse: [newObject := class basicNew].
>     index := class instSize.
>     [index > 0]
>         whileTrue: 
>             [newObject instVarAt: index put: (self instVarAt: index).
>             index := index - 1].
>     ^ newObject
> 
> The code of the primitiveClone looks the same [1]
> 
> 
> Changing `Context>copyTo:` by adding a `self pc` before calling `self copy` leads to no more crashes. Not sure if there is a reason for that or just plain luck.
> 
> A simple reduced stack is below (more details in this issue [1]). The crash happens always with contexts reified as objects (in this case 0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages). 
> Could this suggest some kind of issue in the vm when reifying contexts, or just some other problem with memory corruption?
> 
> This looks like an oversight in some primitive.  Here for example is the implementation of the shallowCopy primitive, a.k.a. clone, and you can see where it explcitly intercepts access to a context.
> 
> primitiveClone
> 	"Return a shallow copy of the receiver.
> 	 Special-case non-single contexts (because of context-to-stack mapping).
> 	 Can't fail for contexts cuz of image context instantiation code (sigh)."
> 
> 	| rcvr newCopy |
> 	rcvr := self stackTop.
> 	(objectMemory isImmediate: rcvr)
> 		ifTrue:
> 			[newCopy := rcvr]
> 		ifFalse:
> 			[(objectMemory isContextNonImm: rcvr)
> 				ifTrue:
> 					[newCopy := self cloneContext: rcvr]
> 				ifFalse:
> 					[(argumentCount = 0
> 					  or: [(objectMemory isForwarded: rcvr) not])
> 						ifTrue: [newCopy := objectMemory clone: rcvr]
> 						ifFalse: [newCopy := 0]].
> 			newCopy = 0 ifTrue:
> 				[^self primitiveFailFor: PrimErrNoMemory]].
> 	self pop: argumentCount + 1 thenPush: newCopy
> 
> But since Squeak doesn't have copyTo: I have no idea what primitive is being used.  I'm guessing 168 primitiveCopyObject, which seems to check for a Context receiver, but not for a CompiledCode receiver.  What does the primitive failure code look like?  Can you post the copyTo: implementations here please?
> 
> The code is above. I also see Context>>#copyTo: in Squeak calling also Object>>copy for contexts.
> 
> When a crash happens we don't get the exact same error all the time. For example we get most often on mac:
> 
> Process 35690 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
>     frame #0: 0x00000001100b1004
> ->  0x1100b1004: inl    $0x4c, %eax
>     0x1100b1006: leal   -0x5c(%rip), %eax
>     0x1100b100c: pushq  %r8
>     0x1100b100e: movabsq $0x1109e78e0, %r9         ; imm = 0x1109E78E0 
> Target 0: (GlamorousToolkit) stopped.
> 
> 
> Process 29929 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x0)
>     frame #0: 0x00000001100fe7ed
> ->  0x1100fe7ed: int3   
>     0x1100fe7ee: int3   
>     0x1100fe7ef: int3   
>     0x1100fe7f0: int3   
> Target 0: (GlamorousToolkit) stopped.
> 
> [1] https://github.com/feenkcom/opensmalltalk-vm/blob/5f7d49227c9599a35fcb93892b727c93a573482c/smalltalksrc/VMMaker/StackInterpreterPrimitives.class.st#L325 <https://github.com/feenkcom/opensmalltalk-vm/blob/5f7d49227c9599a35fcb93892b727c93a573482c/smalltalksrc/VMMaker/StackInterpreterPrimitives.class.st#L325>
> 
> Cheers,
> Andrei
>  
> 
>  0x7ffeefbb4380 M Context(Object)>copy 0x1206b6990: a(n) Context
>     0x7ffeefbb43b8 M Context>copyTo: 0x1206b6990: a(n) Context
>     0x7ffeefbb4400 M Context>copyTo: 0x1206b5ae0: a(n) Context
>   ...
>     0x7ffeefba6078 M Context>copyTo: 0x110548b28: a(n) Context
>     0x7ffeefba60d0 I Context>copyTo: 0x110548a70: a(n) Context
>     0x7ffeefba6118 I MessageNotUnderstood(Exception)>freezeUpTo: 0x110548a20: a(n) MessageNotUnderstood
>     0x7ffeefba6160 I MessageNotUnderstood(Exception)>freeze 0x110548a20: a(n) MessageNotUnderstood
>     0x7ffeefba6190 M [] in GtExampleEvaluator>result 0x110544fb8: a(n) GtExampleEvaluator
>     0x7ffeefba61c8 M BlockClosure>cull: 0x110545188: a(n) BlockClosure
>     0x7ffeefba6208 M Context>evaluateSignal: 0x110548c98: a(n) Context
>     0x7ffeefba6240 M Context>handleSignal: 0x110548c98: a(n) Context
>     0x7ffeefba6278 M Context>handleSignal: 0x110548be0: a(n) Context
>     0x7ffeefba62b0 M MessageNotUnderstood(Exception)>signal 0x110548a20: a(n) MessageNotUnderstood
>     0x7ffeefba62f0 M GtDummyExamplesWithInheritanceSubclassB(Object)>doesNotUnderstand: exampleH 0x1105487d8: a(n) GtDummyExamplesWithInheritanceSubclassB
>     0x7ffeefba6328 M GtExampleEvaluator>primitiveProcessExample:withEvaluationContext: 0x110544fb8: a(n) GtExampleEvaluator
>  ...
>     0x7ffeefbe64d0 M [] in GtExamplesHDReport class(HDReport class)>runPackages: 0x1145e41c8: a(n) GtExamplesHDReport class
>     0x7ffeefbe6520 M [] in Set>collect: 0x1206b5ab0: a(n) Set
>     0x7ffeefbe6568 M Array(SequenceableCollection)>do: 0x1206b5c50: a(n) Array
>        0x1206b5b98 s Set>collect:
>        0x1206b5ae0 s GtExamplesHDReport class(HDReport class)>runPackages:
>        0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages
>        0x1206b6a48 s BlockClosure>ensure:
>        0x1206b6b68 s UIManager class>nonInteractiveDuring:
>        0x1206b6c48 s GtExamplesCommandLineHandler>runPackages
>        0x1206b6d98 s GtExamplesCommandLineHandler>activate
>        0x1206b75d0 s GtExamplesCommandLineHandler class(CommandLineHandler class)>activateWith:
>        0x1207d2f00 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>        0x1207e6620 s BlockClosure>on:do:
>        0x1207f7ab8 s PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>        0x120809d40 s PharoCommandLineHandler(BasicCommandLineHandler)>handleSubcommand
>        0x12082ca60 s PharoCommandLineHandler(BasicCommandLineHandler)>handleArgument:
>        0x120789938 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>        0x1207a83e0 s BlockClosure>on:do:
>        0x1207b57a0 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>        0x1207bf830 s [] in BlockClosure>newProcess
> Cheers,
> Andrei
> 
> 
> [1] https://github.com/feenkcom/gtoolkit/issues/1440 <https://github.com/feenkcom/gtoolkit/issues/1440>
> 
> 
> 
> -- 
> _,,,^..^,,,_
> best, Eliot
> 
> 
> -- 
> _,,,^..^,,,_
> best, Eliot

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200914/8b3a2091/attachment-0001.html>


More information about the Vm-dev mailing list