[Vm-dev] Ephemerons and VM crash

Eliot Miranda eliot.miranda at gmail.com
Wed May 25 17:46:21 UTC 2016


Hi Guille,

On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi Guille,
>
>     first, v important. can you put the image(s) you're working with
> somewhere I can download, and send me a URL?  I can be more helpful if I
> can run the tests too.
>
>
> On Wed, May 25, 2016 at 6:24 AM, Guille Polito <guillermopolito at gmail.com>
> wrote:
>
>>
>> Hi again!
>>
>> I snip most of the old thread to keep the relevant.
>>
>> -------- Original Message --------
>>
>> 1 - The mourn table is being collected because it is not marked upon
>> creation
>>
>>> ===============
>>>
>>> I decided to move the assignment of marking out into markObjects:
>>
>>
>> Ok, it looks good for me, I did not know what would be a good place to
>> put them without breaking something :)
>>
>>
>>
>> 2 - Accesses to unscannedEphemerons are unaligned
>>> ===============
>>>
>>> Looks good.  I've changed all the ephemeron methods to use bytesPerOop,
>> cuz the ephemeron queue contains oops, but your fix is just as good.
>>
>>
>> cool! I thought so that would be the best.
>>
>>
>>
>>>
>>>
>>> 3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
>>> ==========================
>>>
>>> Question: besides reviewing, how should I commit these fixes? one issue
>>> per commit? all at once?
>>>
>>
>> Typically all at once.  What I try and do is commit a large set of
>> related changes in a single commit, so I wouldn't mix the fixes to
>> ephemerons with, say, fixes to time.  But equally I'd commit any changes I
>> made to help debugging in the same commit.  And all rules are made to be
>> broken ;-).
>>
>> ok!
>>
>>
>> Hey,  _thank you_!  This is really great and ephemerons won't be usable
>> without testing like this.  I'm in your debt.
>>
>>
>> Ha, there is no debt (or at least it would be in the opposite direction).
>>
>> Now I'm working on another crash that I can easily reproduce using this
>> script:
>>
>> Smalltalk supportsQueueingFinalization: true.
>> e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer
>> receiver: 'test', 'asd' selector: #logCr) value: Object new ].
>> Smalltalk garbageCollect.
>>
>
Why the extra level of indirection?  The Ephemeron /is/ the object
finaliser.  It can send finalise to its key.  Why bother with the extra
level of wrapper?  It's wasteful; especially if we're attaching lots of
ephemerons to things cuz we want to finalise something that has lots of
instances.


> While debugging that in the simulator, we saw with clement that during
>> #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are
>> always firing, tenuring and scavenging ephemerons regardless they are
>> marked or not.
>>
>> fireEphemeronsInRememberedSet
>> [SNIP]
>>          coInterpreter fireEphemeron: ephemeron.
>>          manager
>>             storePointerUnchecked: 0
>>             ofObject: ephemeron
>>             withValue: (self copyAndForward: (manager keyOfEphemeron:
>> ephemeron)).
>>         (self scavengeReferentsOf: ephemeron)
>> [SNIP]
>>
>> So I put an assertion to check that the key of the ephemeron is marked
>> when scavenging and it is not always the case...
>>          self assert: (manager isMarked: (manager keyOfEphemeron:
>> ephemeron)).
>>
>> Is the code wrong? or I'm missing something?
>>
>
> The implementation of ephemerons in the scavenger and the implementation
> in mark-sweep are quite different.
>
> Remember that what we're trying to do is find out if an object is only
> referenced from the transitive closure of ephemerons or not.  And the way
> that we do that in both cases is avoid processing ephemerons whose keys are
> not yet reachable from the roots until all objects reachable from the roots
> have been reached.  We do this by putting "unscanned" ephemerons in a
> queue, saving them until later.
>
> In the scavenger the roots are the objects in the remembered set, the
> interpreter state (newMethod etc) and the stack zone, and GC is performed
> by copying all objects reachable from these roots in past and new spaces
> into future space, possibly tenuring overflowed objects into old space.  In
> the mark-sweep the roots are the specialObjectsArray, the interpreter state
> (newMethod etc) and the stack zone.
>
> In the scavenger (a copying collector) this means that when we process the
> unscanned ephemerons their keys will either have been copied into future
> space or tenured to old space, in which case they were reachable from the
> roots, or they will not have been copied yet, in which case they are
> reachable only from ephemerons.  So in the scavenger marking is irrelevant;
> in fact /no/ objects should be marked when scavenging.  The important thing
> is whether a key is in past and new spaces or is in future and old spaces.
>
> In the mark-sweep this means that when we process the unscanned ephemerons
> their keys will either be marked, in which case they were reachable from
> the roots, or unmarked, in which case they are reachable only from
> ephemerons. So the assert for marked-ness only makes sense in ephemeron
> processing in the mark-sweep collector.
>
> HTH
>
>
>
>> Guille
>>
>>
>>
>>
>>>
>>>
>>> Guille
>>>
>>> -------- Original Message --------
>>>
>>> I see, I probably I need to leave the simulator run for a longer time...
>>> Because running the image using the assertion VM took some time to crash,
>>> it may be the same case.
>>>
>>> -------- Original Message --------
>>>
>>> Some advance on my side,
>>>
>>> - Yes, apparently the image was already broken. I started a new one and
>>> I can open it with the simulator. It seems that as soon as I activate
>>> Ephemerons and I have alive ephemerons the heap corrupts (and since saving
>>> the image fires a gc...).
>>>
>>> - Then, to catch the moment where the image breaks, I prepared my image
>>> for debug with the following script:
>>>
>>> Smalltalk snapshotPrimitive.
>>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>> d := EphemeronDictionary new.
>>> d at: f put: f.
>>> f := nil.
>>> Smalltalk supportsQueueingFinalization: true.
>>> 1 to: 2 do: [ :i | Smalltalk garbageCollect].
>>>
>>> Then I opened it with the simulator (latest version:
>>> VMMaker.oscog-tfel.1864) and I had several assertion failures in
>>>
>>> keyOfEphemeron: objOop
>>>     "Answer the object the ephemeron guards.  This is its first element."
>>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>>> objOop]).
>>>     ^self fetchPointer: 0 ofObject: objOop
>>>
>>> followedKeyOfEphemeron: objOop
>>>     "Answer the object the ephemeron guards.  This is its first element."
>>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>>> objOop]).
>>>     ^self followOopField: 0 ofObject: objOop
>>>
>>> So, I noticed the following:
>>>
>>> - Ephemerons in the unscanned list are marked, fired, and the format of
>>> ephemerons is changed to a non-ephemeron format
>>> - But they are not removed from the unscanned list
>>> - then the mark #markAllUnscannedEphemerons fails tons of assertions
>>> because it expects ephemeron objects and the (not-anymore-)ephemeron is
>>> still there in the unscanned list
>>>
>>>      self markInactiveEphemerons ifFalse:
>>>         [self fireAllUnscannedEphemerons].
>>>      self markAllUnscannedEphemerons
>>>
>>> The strange thing is that if I ignore the assertions by proceeding, the
>>> thing continues normally and it the image does not break...
>>>
>>> So I then compiled the VM, with the latest version, and in trace mode to
>>> compare. I attach the log of the stdout. But here there is a résumé
>>>
>>> * First I see the same assertions I saw in the simulator
>>>
>>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155
>>>
>>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477
>>>
>>> * Then there are tons of repetitions of the same assertion:
>>>
>>> !(((GIV(mournQueue) != GIV(nilObj)) &&
>>> (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030
>>>
>>> Which I understand it means that we are adding to the mourn queue all
>>> the time the same object. Which is strange because it is a WeakArray with a
>>> single slot which turns out to be nil (at least at the moment I inspected
>>> it).
>>>
>>> * Then something breaks (!!) Looks like we are trying to mark a
>>> forwarder. And then I have a follow up of assertions which are probably
>>> caused by the same thing.
>>>
>>> !(isForwarded(field)) 49489
>>>
>>> (classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537
>>>
>>> (classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun()))
>>> 49403
>>>
>>> objCouldBeClassObj(classObj) 49415
>>>
>>> addressCouldBeOop(objOop) 59997
>>>
>>> addressCouldBeObj(topOfObjStack((0 ==
>>> (fetchPointerofObject(ObjStackTopx, objStack)) ?
>>> fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002
>>>
>>> addressCouldBeOop(objOop) 59997
>>>
>>> * Finally there are some more
>>>
>>> !(isForwarded(field)) 49489
>>>
>>> until a segmentation fault arrives while executing a fullGC
>>>
>>> ../results/pharo[0x80ceaa4]
>>> ../results/pharo[0x80cfde4]
>>> ../results/pharo(fullGC+0x9b)[0x80c95ea]
>>> ../results/pharo[0x80e2d49]
>>> ../results/pharo[0x80e5f4a]
>>> ../results/pharo[0x80ebca3]
>>> ../results/pharo(ceStackOverflow+0xd7)[0x809356a]
>>> [0x91002c6]
>>> ../results/pharo(interpret+0x128)[0x8082b38]
>>> ../results/pharo(main+0x23a)[0x810a7db]
>>>
>>> /lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
>>> [0x1000]
>>>
>>> I'll continue digging a bit more,
>>> Guille
>>>
>>> -------- Original Message --------
>>>
>>> Hi Guille,
>>>
>>> On Thu, May 19, 2016 at 12:53 AM, Guille Polito <
>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> -------- Original Message --------
>>>>
>>>> Hi Guille,
>>>>
>>>> On Wed, May 18, 2016 at 12:46 AM, Guille Polito <
>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> -------- Original Message --------
>>>>>
>>>>>
>>>>>
>>>>> Hi Guille, Hi Pablo (and welcome),
>>>>>
>>>>> On Tue, May 17, 2016 at 8:37 AM, Guille Polito <
>>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi Eliot, list
>>>>>>
>>>>>> I'm working here with Pablo (Tesone) on moving forward the Ephemeron
>>>>>> implementation.
>>>>>
>>>>>
>>>>
>>>> Where's "here"?   Are you in Lille?
>>>>
>>>> Yup. Pablo is a new Phd student here :)
>>>>
>>>>
>>>>
>>>> We first installed Eliot's changeset, added a #mourn method and an
>>>>>> EphemeronDictionary collection, and then started testing something like
>>>>>> this:
>>>>>>
>>>>>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>>>>> d := EphemeronDictionary new.
>>>>>>
>>>>>> d at: f put: f.
>>>>>>
>>>>>> f := nil.
>>>>>> Smalltalk garbageCollect.
>>>>>>
>>>>>
>>>>> So this looks like something simulate able.  Are you able to use the
>>>>> simulator?  If not, why not?
>>>>>
>>>>>
>>>>> For some reason I have that bytesToShift when opening the image is
>>>>> negative.
>>>>>
>>>>
>>>> That is to be expected.  In the real VM the heap is located somewhere
>>>> well above the bottom of the address space, typically above the program
>>>> code.  In the simulator the heap is located either at 0 (an interpreter or
>>>> stack VM) or immediately above the code zone (in a Cogit VM).  So when an
>>>> image that has been saved on the real VM is loaded into the simulator all
>>>> oops have to be adjusted down and hence bytesToShift is negative.
>>>>
>>>>
>>>>>     bytesToShift := objectMemory memoryBaseForImageRead -
>>>>> oldBaseAddr.  "adjust pointers for zero base address"
>>>>>
>>>>> So I cannot continue loading because addresses become negative and I
>>>>> have "Improper Store into indexable object kind of errors".
>>>>>
>>>>
>>>> Can you post a back trace?  Where is this happening?
>>>>
>>>> I'm working in latest pharo (5.0), using the same source code that is
>>>> used in the Pharo CI jobs to build the VM (pharo branded). I would expect
>>>> this branch to be stable, maybe it could be a bit behind your branch of
>>>> development, but I think Esteban merges regularly.
>>>>
>>>
>>> Esteban is very busy trying to get UFFI working, and having just got
>>> Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if
>>> you used the tip of VMMaker.oscog.  Further, that's the package that
>>> Clément is using and he's your expert source of local help.
>>>
>>>
>>>
>>>>
>>>> I'm executing this,
>>>>
>>>> vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory
>>>> #Spur32BitMemoryManager ).
>>>> vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer
>>>> stack pages."
>>>> (vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
>>>>     openAsMorph; run.
>>>>
>>>> And here is the trace. The error happens while swizzling.
>>>>
>>>> LittleEndianBitmap(Object)>>error:
>>>> LittleEndianBitmap(Object)>>errorImproperStore
>>>> LittleEndianBitmap(Object)>>at:put:
>>>> Spur32BitMMLESimulator>>longAt:put:
>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
>>>> StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
>>>> [ self initializeInterpreter: bytesToShift ] in
>>>> StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory:
>>>> in Block: [ self initializeInterpreter: bytesToShift ]
>>>>
>>>
>>>  So the image is presumably already broken?  Looks to me like it was
>>> corrupted when saved.   Can you reproduce the problem that led to this
>>> image being broken from a working image?
>>>
>>>> What version of VMMaker.oscog are you using?
>>>>
>>>> Monticello reports VMMaker.oscog-eem.1855.
>>>> However, I'm not sure I have the right number, because source code is
>>>> now in git.
>>>>
>>>
>>> It would be so great if you could use Monticello, not git, and hence be
>>> able to contribute back.
>>>
>>>
>>>>   Are you running in Pharo or Squeak?  If you're in Lille you could
>>>> perhaps visit Clément's office and get him to take a look.  Clément, would
>>>> that be ok?
>>>>
>>>> I'll maybe see it with clement, but today I have to work on some other
>>>> project...
>>>>
>>>> When debugging the VM there are two main levels of support, one is the
>>>>> simulator where there is maximum support for debugging:
>>>>> - asserts on all the time
>>>>> - arbitrary breakpoints
>>>>> - attempting every GC in a copy of the heap before doing the real GC
>>>>> so that bugs in the GC can be investigated without needing to construct a
>>>>> reproducible case after a crash
>>>>> - the Smalltalk environment to inspect and browse
>>>>>
>>>>> The next level is the assert and debug VMs.  If you look in the build
>>>>> directories on the Cog svn branch you'll see that all of them build three
>>>>> VMs, a production VM with maximum optimisation and asserts excluded, an
>>>>> assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts
>>>>> enabled.  So if you either don't see the bug in the simulator, or the
>>>>> simulator is too slow for the case being examined, or if the bug doesn't
>>>>> show up in the simulator (the worst of all worlds), build both assert and
>>>>> debug VMs and run with the assert VM first.
>>>>>
>>>>> Well so far we were using a VM compiled for debug with a graphical C
>>>>> debugger. It was not so bad. However, I cannot say I'm missing a better
>>>>> debugger.
>>>>>
>>>>
>>>> "Compiled for debug" is vague.  Do you mean it is compiled with -g -O0,
>>>> or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
>>>>
>>>>
>>>> That would exactly be:
>>>>
>>>> { '-g3'.
>>>>         '-O0'.
>>>>         '-msse2'.
>>>>         '-D_GNU_SOURCE'.
>>>>         '-DITIMER_HEARTBEAT=1'.
>>>>         '-DNO_VM_PROFILE=1'.
>>>>         '-DDEBUGVM=1'.
>>>>         '-DNDEBUG' }
>>>>
>>>> And Esteban told me to compile it with '-UNDEBUG' to get more
>>>> information, but I did not try this yet.
>>>>
>>>
>>> Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert
>>> or debug VM unless you do not use -NDEBUG.
>>>
>>> Thanks!
>>>>
>>>> Note that there is a heap leak checker which can be enabled both in the
>>>>> simulator and the assert and debug VMs.  See the checkForLeaks method and
>>>>> the -leakcheck argument.
>>>>>
>>>>>
>>>>> ok!
>>>>>
>>>>>
>>>>> Without the simulator or the assert and debug VMs you are flying
>>>>> blind.  It is /really/ productive to use the simulator for debugging,
>>>>> provided the bug is reproducible within a short amount of time, as for
>>>>> example your case is above.
>>>>>
>>>>>
>>>>> Ok, gotcha! By this afternoon I'll have some news probably.
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> However, as soon as we garbage collect twice, we have a VM crash. We
>>>>>> started debugging the VM to see if we could have some more clues.
>>>>>>
>>>>>> The first thing we noticed is that the first time the GC runs, the
>>>>>> mournQueue is nil. This is of course expected because the new finalization
>>>>>> mechanism was not active and then there was no need to create the
>>>>>> mournQueue. We saw that the mournQueue is actually created in a lazy
>>>>>> fashion when putting queuing a mourned object (I refer myself to
>>>>>> #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC
>>>>>> passes, the mournQueue is there. So far ok, but still crashing.
>>>>>>
>>>>>> The crash happens in the call to
>>>>>>
>>>>>> markAndTraceObjStackandContents(GIV(mournQueue), 1);
>>>>>>
>>>>>> after the
>>>>>>
>>>>>>     if (!markAndTraceContents) {
>>>>>>        return;
>>>>>>     }
>>>>>>
>>>>>> But when understanding why, it starts being less clear to us :). We
>>>>>> used the printObjStack() function and we saw that:
>>>>>>
>>>>>> call printObjStack(markStack)
>>>>>> call printObjStack(weaklingStack)
>>>>>>
>>>>>> and we saw in the console some output that makes sense. However,
>>>>>> printing the mournQueue in the same manner produces some strange output
>>>>>>
>>>>>> call printObjStack(mournQueue)
>>>>>>
>>>>>> head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098
>>>>>> (4098) unmkd
>>>>>>     topx: 14 next:        0x0 free:        0x0
>>>>>>
>>>>>> We noticed that free and next are 0x0 while the others are not...
>>>>>>
>>>>>> Finally we saw there is isValidObjStack(), that gave us the following
>>>>>> results:
>>>>>>
>>>>>> call isValidObjStack(markStack) => 1
>>>>>>
>>>>>> call isValidObjStack(weaklingStack) => 0
>>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>>
>>>>>> call isValidObjStack(mournQueue) => 0
>>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>>
>>>>>>
>>>>>> So we assume that the stack creation is wrong? We are a bit lost in
>>>>>> here.
>>>>>>
>>>>>> Guille and Pablo
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> _,,,^..^,,,_
>>>>> best, Eliot
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> _,,,^..^,,,_
>>> best, Eliot
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>>
>>
>>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>



-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160525/7fb12292/attachment-0001.htm


More information about the Vm-dev mailing list