[Vm-dev] Ephemerons and VM crash

Wed May 25 17:43:01 UTC 2016

Hi Guille,

    first, v important. can you put the image(s) you're working with
somewhere I can download, and send me a URL?  I can be more helpful if I
can run the tests too.

On Wed, May 25, 2016 at 6:24 AM, Guille Polito <guillermopolito at gmail.com>
wrote:

>
> Hi again!
>
> I snip most of the old thread to keep the relevant.
>
> -------- Original Message --------
>
> 1 - The mourn table is being collected because it is not marked upon
> creation
>
>> ===============
>>
>> I decided to move the assignment of marking out into markObjects:
>
>
> Ok, it looks good for me, I did not know what would be a good place to put
> them without breaking something :)
>
>
>
> 2 - Accesses to unscannedEphemerons are unaligned
>> ===============
>>
>> Looks good.  I've changed all the ephemeron methods to use bytesPerOop,
> cuz the ephemeron queue contains oops, but your fix is just as good.
>
>
> cool! I thought so that would be the best.
>
>
>
>>
>>
>> 3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
>> ==========================
>>
>> Question: besides reviewing, how should I commit these fixes? one issue
>> per commit? all at once?
>>
>
> Typically all at once.  What I try and do is commit a large set of related
> changes in a single commit, so I wouldn't mix the fixes to ephemerons with,
> say, fixes to time.  But equally I'd commit any changes I made to help
> debugging in the same commit.  And all rules are made to be broken ;-).
>
> ok!
>
>
> Hey,  _thank you_!  This is really great and ephemerons won't be usable
> without testing like this.  I'm in your debt.
>
>
> Ha, there is no debt (or at least it would be in the opposite direction).
>
> Now I'm working on another crash that I can easily reproduce using this
> script:
>
> Smalltalk supportsQueueingFinalization: true.
> e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer
> receiver: 'test', 'asd' selector: #logCr) value: Object new ].
> Smalltalk garbageCollect.
>
>
> While debugging that in the simulator, we saw with clement that during
> #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are
> always firing, tenuring and scavenging ephemerons regardless they are
> marked or not.
>
> fireEphemeronsInRememberedSet
> [SNIP]
>          coInterpreter fireEphemeron: ephemeron.
>          manager
>             storePointerUnchecked: 0
>             ofObject: ephemeron
>             withValue: (self copyAndForward: (manager keyOfEphemeron:
> ephemeron)).
>         (self scavengeReferentsOf: ephemeron)
> [SNIP]
>
> So I put an assertion to check that the key of the ephemeron is marked
> when scavenging and it is not always the case...
>          self assert: (manager isMarked: (manager keyOfEphemeron:
> ephemeron)).
>
> Is the code wrong? or I'm missing something?
>

The implementation of ephemerons in the scavenger and the implementation in
mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only
referenced from the transitive closure of ephemerons or not.  And the way
that we do that in both cases is avoid processing ephemerons whose keys are
not yet reachable from the roots until all objects reachable from the roots
have been reached.  We do this by putting "unscanned" ephemerons in a
queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the
interpreter state (newMethod etc) and the stack zone, and GC is performed
by copying all objects reachable from these roots in past and new spaces
into future space, possibly tenuring overflowed objects into old space.  In
the mark-sweep the roots are the specialObjectsArray, the interpreter state
(newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the
unscanned ephemerons their keys will either have been copied into future
space or tenured to old space, in which case they were reachable from the
roots, or they will not have been copied yet, in which case they are
reachable only from ephemerons.  So in the scavenger marking is irrelevant;
in fact /no/ objects should be marked when scavenging.  The important thing
is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons
their keys will either be marked, in which case they were reachable from
the roots, or unmarked, in which case they are reachable only from
ephemerons. So the assert for marked-ness only makes sense in ephemeron
processing in the mark-sweep collector.

HTH

> Guille
>
>
>
>>
>>
>> Guille
>>
>> -------- Original Message --------
>>
>> I see, I probably I need to leave the simulator run for a longer time...
>> Because running the image using the assertion VM took some time to crash,
>> it may be the same case.
>>
>> -------- Original Message --------
>>
>> Some advance on my side,
>>
>> - Yes, apparently the image was already broken. I started a new one and I
>> can open it with the simulator. It seems that as soon as I activate
>> Ephemerons and I have alive ephemerons the heap corrupts (and since saving
>> the image fires a gc...).
>>
>> - Then, to catch the moment where the image breaks, I prepared my image
>> for debug with the following script:
>>
>> Smalltalk snapshotPrimitive.
>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>> d := EphemeronDictionary new.
>> d at: f put: f.
>> f := nil.
>> Smalltalk supportsQueueingFinalization: true.
>> 1 to: 2 do: [ :i | Smalltalk garbageCollect].
>>
>> Then I opened it with the simulator (latest version:
>> VMMaker.oscog-tfel.1864) and I had several assertion failures in
>>
>> keyOfEphemeron: objOop
>>     "Answer the object the ephemeron guards.  This is its first element."
>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>> objOop]).
>>     ^self fetchPointer: 0 ofObject: objOop
>>
>> followedKeyOfEphemeron: objOop
>>     "Answer the object the ephemeron guards.  This is its first element."
>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>> objOop]).
>>     ^self followOopField: 0 ofObject: objOop
>>
>> So, I noticed the following:
>>
>> - Ephemerons in the unscanned list are marked, fired, and the format of
>> ephemerons is changed to a non-ephemeron format
>> - But they are not removed from the unscanned list
>> - then the mark #markAllUnscannedEphemerons fails tons of assertions
>> because it expects ephemeron objects and the (not-anymore-)ephemeron is
>> still there in the unscanned list
>>
>>      self markInactiveEphemerons ifFalse:
>>         [self fireAllUnscannedEphemerons].
>>      self markAllUnscannedEphemerons
>>
>> The strange thing is that if I ignore the assertions by proceeding, the
>> thing continues normally and it the image does not break...
>>
>> So I then compiled the VM, with the latest version, and in trace mode to
>> compare. I attach the log of the stdout. But here there is a résumé
>>
>> * First I see the same assertions I saw in the simulator
>>
>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155
>>
>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477
>>
>> * Then there are tons of repetitions of the same assertion:
>>
>> !(((GIV(mournQueue) != GIV(nilObj)) &&
>> (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030
>>
>> Which I understand it means that we are adding to the mourn queue all the
>> time the same object. Which is strange because it is a WeakArray with a
>> single slot which turns out to be nil (at least at the moment I inspected
>> it).
>>
>> * Then something breaks (!!) Looks like we are trying to mark a
>> forwarder. And then I have a follow up of assertions which are probably
>> caused by the same thing.
>>
>> !(isForwarded(field)) 49489
>>
>> (classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537
>>
>> (classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun()))
>> 49403
>>
>> objCouldBeClassObj(classObj) 49415
>>
>> addressCouldBeOop(objOop) 59997
>>
>> addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx,
>> objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack)))
>> 60002
>>
>> addressCouldBeOop(objOop) 59997
>>
>> * Finally there are some more
>>
>> !(isForwarded(field)) 49489
>>
>> until a segmentation fault arrives while executing a fullGC
>>
>> ../results/pharo[0x80ceaa4]
>> ../results/pharo[0x80cfde4]
>> ../results/pharo(fullGC+0x9b)[0x80c95ea]
>> ../results/pharo[0x80e2d49]
>> ../results/pharo[0x80e5f4a]
>> ../results/pharo[0x80ebca3]
>> ../results/pharo(ceStackOverflow+0xd7)[0x809356a]
>> [0x91002c6]
>> ../results/pharo(interpret+0x128)[0x8082b38]
>> ../results/pharo(main+0x23a)[0x810a7db]
>>
>> /lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
>> [0x1000]
>>
>> I'll continue digging a bit more,
>> Guille
>>
>> -------- Original Message --------
>>
>> Hi Guille,
>>
>> On Thu, May 19, 2016 at 12:53 AM, Guille Polito <
>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>
>>>
>>>
>>> -------- Original Message --------
>>>
>>> Hi Guille,
>>>
>>> On Wed, May 18, 2016 at 12:46 AM, Guille Polito <
>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> -------- Original Message --------
>>>>
>>>>
>>>>
>>>> Hi Guille, Hi Pablo (and welcome),
>>>>
>>>> On Tue, May 17, 2016 at 8:37 AM, Guille Polito <
>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi Eliot, list
>>>>>
>>>>> I'm working here with Pablo (Tesone) on moving forward the Ephemeron
>>>>> implementation.
>>>>
>>>>
>>>
>>> Where's "here"?   Are you in Lille?
>>>
>>> Yup. Pablo is a new Phd student here :)
>>>
>>>
>>>
>>> We first installed Eliot's changeset, added a #mourn method and an
>>>>> EphemeronDictionary collection, and then started testing something like
>>>>> this:
>>>>>
>>>>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>>>> d := EphemeronDictionary new.
>>>>>
>>>>> d at: f put: f.
>>>>>
>>>>> f := nil.
>>>>> Smalltalk garbageCollect.
>>>>>
>>>>
>>>> So this looks like something simulate able.  Are you able to use the
>>>> simulator?  If not, why not?
>>>>
>>>>
>>>> For some reason I have that bytesToShift when opening the image is
>>>> negative.
>>>>
>>>
>>> That is to be expected.  In the real VM the heap is located somewhere
>>> well above the bottom of the address space, typically above the program
>>> code.  In the simulator the heap is located either at 0 (an interpreter or
>>> stack VM) or immediately above the code zone (in a Cogit VM).  So when an
>>> image that has been saved on the real VM is loaded into the simulator all
>>> oops have to be adjusted down and hence bytesToShift is negative.
>>>
>>>
>>>>     bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.
>>>> "adjust pointers for zero base address"
>>>>
>>>> So I cannot continue loading because addresses become negative and I
>>>> have "Improper Store into indexable object kind of errors".
>>>>
>>>
>>> Can you post a back trace?  Where is this happening?
>>>
>>> I'm working in latest pharo (5.0), using the same source code that is
>>> used in the Pharo CI jobs to build the VM (pharo branded). I would expect
>>> this branch to be stable, maybe it could be a bit behind your branch of
>>> development, but I think Esteban merges regularly.
>>>
>>
>> Esteban is very busy trying to get UFFI working, and having just got
>> Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if
>> you used the tip of VMMaker.oscog.  Further, that's the package that
>> Clément is using and he's your expert source of local help.
>>
>>
>>
>>>
>>> I'm executing this,
>>>
>>> vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory
>>> #Spur32BitMemoryManager ).
>>> vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer
>>> stack pages."
>>> (vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
>>>     openAsMorph; run.
>>>
>>> And here is the trace. The error happens while swizzling.
>>>
>>> LittleEndianBitmap(Object)>>error:
>>> LittleEndianBitmap(Object)>>errorImproperStore
>>> LittleEndianBitmap(Object)>>at:put:
>>> Spur32BitMMLESimulator>>longAt:put:
>>> Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
>>> Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
>>> Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
>>> StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
>>> [ self initializeInterpreter: bytesToShift ] in
>>> StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory:
>>> in Block: [ self initializeInterpreter: bytesToShift ]
>>>
>>
>>  So the image is presumably already broken?  Looks to me like it was
>> corrupted when saved.   Can you reproduce the problem that led to this
>> image being broken from a working image?
>>
>>> What version of VMMaker.oscog are you using?
>>>
>>> Monticello reports VMMaker.oscog-eem.1855.
>>> However, I'm not sure I have the right number, because source code is
>>> now in git.
>>>
>>
>> It would be so great if you could use Monticello, not git, and hence be
>> able to contribute back.
>>
>>
>>>   Are you running in Pharo or Squeak?  If you're in Lille you could
>>> perhaps visit Clément's office and get him to take a look.  Clément, would
>>> that be ok?
>>>
>>> I'll maybe see it with clement, but today I have to work on some other
>>> project...
>>>
>>> When debugging the VM there are two main levels of support, one is the
>>>> simulator where there is maximum support for debugging:
>>>> - asserts on all the time
>>>> - arbitrary breakpoints
>>>> - attempting every GC in a copy of the heap before doing the real GC so
>>>> that bugs in the GC can be investigated without needing to construct a
>>>> reproducible case after a crash
>>>> - the Smalltalk environment to inspect and browse
>>>>
>>>> The next level is the assert and debug VMs.  If you look in the build
>>>> directories on the Cog svn branch you'll see that all of them build three
>>>> VMs, a production VM with maximum optimisation and asserts excluded, an
>>>> assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts
>>>> enabled.  So if you either don't see the bug in the simulator, or the
>>>> simulator is too slow for the case being examined, or if the bug doesn't
>>>> show up in the simulator (the worst of all worlds), build both assert and
>>>> debug VMs and run with the assert VM first.
>>>>
>>>> Well so far we were using a VM compiled for debug with a graphical C
>>>> debugger. It was not so bad. However, I cannot say I'm missing a better
>>>> debugger.
>>>>
>>>
>>> "Compiled for debug" is vague.  Do you mean it is compiled with -g -O0,
>>> or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
>>>
>>>
>>> That would exactly be:
>>>
>>> { '-g3'.
>>>         '-O0'.
>>>         '-msse2'.
>>>         '-D_GNU_SOURCE'.
>>>         '-DITIMER_HEARTBEAT=1'.
>>>         '-DNO_VM_PROFILE=1'.
>>>         '-DDEBUGVM=1'.
>>>         '-DNDEBUG' }
>>>
>>> And Esteban told me to compile it with '-UNDEBUG' to get more
>>> information, but I did not try this yet.
>>>
>>
>> Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert
>> or debug VM unless you do not use -NDEBUG.
>>
>> Thanks!
>>>
>>> Note that there is a heap leak checker which can be enabled both in the
>>>> simulator and the assert and debug VMs.  See the checkForLeaks method and
>>>> the -leakcheck argument.
>>>>
>>>>
>>>> ok!
>>>>
>>>>
>>>> Without the simulator or the assert and debug VMs you are flying
>>>> blind.  It is /really/ productive to use the simulator for debugging,
>>>> provided the bug is reproducible within a short amount of time, as for
>>>> example your case is above.
>>>>
>>>>
>>>> Ok, gotcha! By this afternoon I'll have some news probably.
>>>>
>>>> Thanks a lot!
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> However, as soon as we garbage collect twice, we have a VM crash. We
>>>>> started debugging the VM to see if we could have some more clues.
>>>>>
>>>>> The first thing we noticed is that the first time the GC runs, the
>>>>> mournQueue is nil. This is of course expected because the new finalization
>>>>> mechanism was not active and then there was no need to create the
>>>>> mournQueue. We saw that the mournQueue is actually created in a lazy
>>>>> fashion when putting queuing a mourned object (I refer myself to
>>>>> #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC
>>>>> passes, the mournQueue is there. So far ok, but still crashing.
>>>>>
>>>>> The crash happens in the call to
>>>>>
>>>>> markAndTraceObjStackandContents(GIV(mournQueue), 1);
>>>>>
>>>>> after the
>>>>>
>>>>>     if (!markAndTraceContents) {
>>>>>        return;
>>>>>     }
>>>>>
>>>>> But when understanding why, it starts being less clear to us :). We
>>>>> used the printObjStack() function and we saw that:
>>>>>
>>>>> call printObjStack(markStack)
>>>>> call printObjStack(weaklingStack)
>>>>>
>>>>> and we saw in the console some output that makes sense. However,
>>>>> printing the mournQueue in the same manner produces some strange output
>>>>>
>>>>> call printObjStack(mournQueue)
>>>>>
>>>>> head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098)
>>>>> unmkd
>>>>>     topx: 14 next:        0x0 free:        0x0
>>>>>
>>>>> We noticed that free and next are 0x0 while the others are not...
>>>>>
>>>>> Finally we saw there is isValidObjStack(), that gave us the following
>>>>> results:
>>>>>
>>>>> call isValidObjStack(markStack) => 1
>>>>>
>>>>> call isValidObjStack(weaklingStack) => 0
>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>
>>>>> call isValidObjStack(mournQueue) => 0
>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>
>>>>>
>>>>> So we assume that the stack creation is wrong? We are a bit lost in
>>>>> here.
>>>>>
>>>>> Guille and Pablo
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> _,,,^..^,,,_
>>> best, Eliot
>>>
>>>
>>>
>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>>
>>
>>
>>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>
>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160525/821aad06/attachment-0001.htm