[Vm-dev] corruption of PC in context objects or not (?)

Eliot Miranda eliot.miranda at gmail.com
Fri Sep 11 23:42:47 UTC 2020


Hi Andrei,

On Fri, Sep 11, 2020 at 11:48 AM Andrei Chis <chisvasileandrei at gmail.com>
wrote:

>
> Hi Eliot,
>
> Thanks for the answer. That helps to understand what is going on and it
> can explain why just adding a call to `self pc` makes the crash disappear.
>
> Just what was maybe not obvious in my previous email is that we get this
> problem more or less randomly. We have tests for verifying that tools work
> when various extensions raise exceptions (these tests copy the stack).
> Sometimes they work correctly and sometimes they crash. These crashes
> happen in various tests and until now the only common thing we noticed is
> that the pc of the contexts where the crash happens looks off. Also the
> contexts in which this happens are at the beginning of the stack so part of
> a long computation (it gets copied multiple times).
>
> Initially we suspected that there is some memory corruption somewhere due
> to external calls/memory. Just the fact that calling `self pc` before seems
> to fix the issue reduces those chances. But who knows.
>

Well, it does look like a VM bug.  The VM is somehow failing to intercept
some access, perhaps in shallow copy.  Weird.  I shall try and reproduce.
Is there anything special about the process you copy using copyTo: ?

(see below)

On Fri, Sep 11, 2020 at 6:36 PM Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
>>
>> Hi Andrei,
>>
>> On Fri, Sep 11, 2020 at 8:58 AM Andrei Chis <chisvasileandrei at gmail.com>
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> We are getting often crashes on our CI when calling `Context>copyTo:` in
>>> a GT image and a vm build from
>>> https://github.com/feenkcom/opensmalltalk-vm.
>>>
>>> To sum up during `Context>copyTo:`, `Object>>#copy` is called on a
>>> context leading to a segmentation fault crash. Looking at that context in
>>> lldb the pc looks off.  It has the value `0xfffffffffea7f6e1`.
>>>
>>>  (lldb) call (void *) printOop(0x1206b6990)
>>>     0x1206b6990: a(n) Context
>>>      0x1206b6a48 0xfffffffffea7f6e1                0x9        0x1146b2e08        0x1206b6b00
>>>      0x1206b6b28        0x1206b6b50
>>>
>>>
>>> Can this indicate some corruption or is it expected to have such values?
>>> `CoInterpreter>>ensureContextHasBytecodePC:` has code that also handles
>>> negative values for the pc which suggests that this might be expected.
>>>
>>
>> The issue is that that value is expected *inside* the VM.  It is the
>> frame pointer for the context.  But above the Vm this value should be
>> hidden. The VM should intercept all accesses to such fields in contexts and
>> automatically map them back to the appropriate values that the image
>> expects to see.  [The same thing is true for CompiledMethods; inside the VM
>> methods may refer to their JITted code, but this is invisible from the
>> image].  Intercepting access to Context state already happens with inst var
>> access in methods, with the shallowCopy primitive, with instVarAt: et al,
>> etc.
>>
>> So I expect the issue here is that copyTo: invokes some primitive which
>> does not (yet) check for a context receiver and/or argument, and hence
>> accidentally it reveals the hidden state to the image and a crash results.
>> What I need to know are the definitions for copyTo: and copy, etc all the
>> way down to primitives.
>>
>
> Here is the source code:
>

Cool, nothing unusual here.  This should all work perfectly.  Tis a VM bug.
However...


> Context >> copyTo: aContext
> "Copy self and my sender chain down to, but not including, aContext.  End
> of copied chain will have nil sender."
>     | copy |
>     self == aContext ifTrue: [^ nil].
>     copy := self copy.
>     self sender ifNotNil: [
>         copy privSender: (self sender copyTo: aContext)].
>     ^ copy
>

Let me suggest

Context >> copyTo: aContext
   "Copy self and my sender chain down to, but not including, aContext.
End of copied chain will have nil sender."
    | copy |
    self == aContext ifTrue: [^ nil].
    copy := self copy.
    self sender ifNotNil:
        [:mySender| copy privSender: (mySender copyTo: aContext)].
    ^ copy

Object>>#copy
>      ^self shallowCopy postCopy
>
> Object >> shallowCopy
>     | class newObject index |
>     <primitive: 148>
>     class := self class.
>     class isVariable
>         ifTrue:
>             [index := self basicSize.
>             newObject := class basicNew: index.
>             [index > 0]
>                 whileTrue:
>                     [newObject basicAt: index put: (self basicAt: index).
>                     index := index - 1]]
>         ifFalse: [newObject := class basicNew].
>     index := class instSize.
>     [index > 0]
>         whileTrue:
>             [newObject instVarAt: index put: (self instVarAt: index).
>             index := index - 1].
>     ^ newObject
>
> The code of the primitiveClone looks the same [1]
>
>
>> Changing `Context>copyTo:` by adding a `self pc` before calling `self
>>> copy` leads to no more crashes. Not sure if there is a reason for that or
>>> just plain luck.
>>>
>>> A simple reduced stack is below (more details in this issue [1]). The
>>> crash happens always with contexts reified as objects (in this case
>>> 0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages).
>>> Could this suggest some kind of issue in the vm when reifying contexts,
>>> or just some other problem with memory corruption?
>>>
>>
>> This looks like an oversight in some primitive.  Here for example is the
>> implementation of the shallowCopy primitive, a.k.a. clone, and you can see
>> where it explcitly intercepts access to a context.
>>
>> primitiveClone
>> "Return a shallow copy of the receiver.
>> Special-case non-single contexts (because of context-to-stack mapping).
>> Can't fail for contexts cuz of image context instantiation code (sigh)."
>>
>> | rcvr newCopy |
>> rcvr := self stackTop.
>> (objectMemory isImmediate: rcvr)
>> ifTrue:
>> [newCopy := rcvr]
>> ifFalse:
>> [(objectMemory isContextNonImm: rcvr)
>> ifTrue:
>> [newCopy := self cloneContext: rcvr]
>> ifFalse:
>> [(argumentCount = 0
>>  or: [(objectMemory isForwarded: rcvr) not])
>> ifTrue: [newCopy := objectMemory clone: rcvr]
>> ifFalse: [newCopy := 0]].
>> newCopy = 0 ifTrue:
>> [^self primitiveFailFor: PrimErrNoMemory]].
>> self pop: argumentCount + 1 thenPush: newCopy
>>
>> But since Squeak doesn't have copyTo: I have no idea what primitive is
>> being used.  I'm guessing 168 primitiveCopyObject, which seems to check for
>> a Context receiver, but not for a CompiledCode receiver.  What does the
>> primitive failure code look like?  Can you post the copyTo: implementations
>> here please?
>>
>
> The code is above. I also see Context>>#copyTo: in Squeak calling also
> Object>>copy for contexts.
>
> When a crash happens we don't get the exact same error all the time. For
> example we get most often on mac:
>
> Process 35690 stopped
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
> (code=EXC_I386_GPFLT)
>
>     frame #0: 0x00000001100b1004
>
> ->  0x1100b1004: inl    $0x4c, %eax
>
>     0x1100b1006: leal   -0x5c(%rip), %eax
>
>     0x1100b100c: pushq  %r8
>
>     0x1100b100e: movabsq $0x1109e78e0, %r9         ; imm = 0x1109E78E0
>
> Target 0: (GlamorousToolkit) stopped.
>
>
> Process 29929 stopped
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT
> (code=EXC_I386_BPT, subcode=0x0)
>
>     frame #0: 0x00000001100fe7ed
>
> ->  0x1100fe7ed: int3
>
>     0x1100fe7ee: int3
>
>     0x1100fe7ef: int3
>
>     0x1100fe7f0: int3
>
> Target 0: (GlamorousToolkit) stopped.
>
>
> [1]
> https://github.com/feenkcom/opensmalltalk-vm/blob/5f7d49227c9599a35fcb93892b727c93a573482c/smalltalksrc/VMMaker/StackInterpreterPrimitives.class.st#L325
>
> Cheers,
> Andrei
>
>
>>
>>  0x7ffeefbb4380 M Context(Object)>copy 0x1206b6990: a(n) Context
>>>     0x7ffeefbb43b8 M Context>copyTo: 0x1206b6990: a(n) Context
>>>     0x7ffeefbb4400 M Context>copyTo: 0x1206b5ae0: a(n) Context
>>>   ...
>>>     0x7ffeefba6078 M Context>copyTo: 0x110548b28: a(n) Context
>>>     0x7ffeefba60d0 I Context>copyTo: 0x110548a70: a(n) Context
>>>     0x7ffeefba6118 I MessageNotUnderstood(Exception)>freezeUpTo: 0x110548a20: a(n) MessageNotUnderstood
>>>     0x7ffeefba6160 I MessageNotUnderstood(Exception)>freeze 0x110548a20: a(n) MessageNotUnderstood
>>>     0x7ffeefba6190 M [] in GtExampleEvaluator>result 0x110544fb8: a(n) GtExampleEvaluator
>>>     0x7ffeefba61c8 M BlockClosure>cull: 0x110545188: a(n) BlockClosure
>>>     0x7ffeefba6208 M Context>evaluateSignal: 0x110548c98: a(n) Context
>>>     0x7ffeefba6240 M Context>handleSignal: 0x110548c98: a(n) Context
>>>     0x7ffeefba6278 M Context>handleSignal: 0x110548be0: a(n) Context
>>>     0x7ffeefba62b0 M MessageNotUnderstood(Exception)>signal 0x110548a20: a(n) MessageNotUnderstood
>>>     0x7ffeefba62f0 M GtDummyExamplesWithInheritanceSubclassB(Object)>doesNotUnderstand: exampleH 0x1105487d8: a(n) GtDummyExamplesWithInheritanceSubclassB
>>>     0x7ffeefba6328 M GtExampleEvaluator>primitiveProcessExample:withEvaluationContext: 0x110544fb8: a(n) GtExampleEvaluator
>>>  ...
>>>     0x7ffeefbe64d0 M [] in GtExamplesHDReport class(HDReport class)>runPackages: 0x1145e41c8: a(n) GtExamplesHDReport class
>>>     0x7ffeefbe6520 M [] in Set>collect: 0x1206b5ab0: a(n) Set
>>>     0x7ffeefbe6568 M Array(SequenceableCollection)>do: 0x1206b5c50: a(n) Array
>>>        0x1206b5b98 s Set>collect:
>>>        0x1206b5ae0 s GtExamplesHDReport class(HDReport class)>runPackages:
>>>        0x1206b6990 s [] in GtExamplesCommandLineHandler>runPackages
>>>        0x1206b6a48 s BlockClosure>ensure:
>>>        0x1206b6b68 s UIManager class>nonInteractiveDuring:
>>>        0x1206b6c48 s GtExamplesCommandLineHandler>runPackages
>>>        0x1206b6d98 s GtExamplesCommandLineHandler>activate
>>>        0x1206b75d0 s GtExamplesCommandLineHandler class(CommandLineHandler class)>activateWith:
>>>        0x1207d2f00 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>        0x1207e6620 s BlockClosure>on:do:
>>>        0x1207f7ab8 s PharoCommandLineHandler(BasicCommandLineHandler)>activateSubCommand:
>>>        0x120809d40 s PharoCommandLineHandler(BasicCommandLineHandler)>handleSubcommand
>>>        0x12082ca60 s PharoCommandLineHandler(BasicCommandLineHandler)>handleArgument:
>>>        0x120789938 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>        0x1207a83e0 s BlockClosure>on:do:
>>>        0x1207b57a0 s [] in PharoCommandLineHandler(BasicCommandLineHandler)>activate
>>>        0x1207bf830 s [] in BlockClosure>newProcess
>>>
>>> Cheers,
>>> Andrei
>>>
>>>
>>> [1] https://github.com/feenkcom/gtoolkit/issues/1440
>>>
>>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200911/83df9fe1/attachment-0001.html>


More information about the Vm-dev mailing list