[Vm-dev] Ephemerons and VM crash

Wed May 25 18:30:56 UTC 2016

On Wed, May 25, 2016 at 10:46 AM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi Guille,
>
> On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
>> Hi Guille,
>>
>>     first, v important. can you put the image(s) you're working with
>> somewhere I can download, and send me a URL?  I can be more helpful if I
>> can run the tests too.
>>
>>
>> On Wed, May 25, 2016 at 6:24 AM, Guille Polito <guillermopolito at gmail.com
>> > wrote:
>>
>>>
>>> Hi again!
>>>
>>> I snip most of the old thread to keep the relevant.
>>>
>>> -------- Original Message --------
>>>
>>> 1 - The mourn table is being collected because it is not marked upon
>>> creation
>>>
>>>> ===============
>>>>
>>>> I decided to move the assignment of marking out into markObjects:
>>>
>>>
>>> Ok, it looks good for me, I did not know what would be a good place to
>>> put them without breaking something :)
>>>
>>>
>>>
>>> 2 - Accesses to unscannedEphemerons are unaligned
>>>> ===============
>>>>
>>>> Looks good.  I've changed all the ephemeron methods to use bytesPerOop,
>>> cuz the ephemeron queue contains oops, but your fix is just as good.
>>>
>>>
>>> cool! I thought so that would be the best.
>>>
>>>
>>>
>>>>
>>>>
>>>> 3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
>>>> ==========================
>>>>
>>>> Question: besides reviewing, how should I commit these fixes? one issue
>>>> per commit? all at once?
>>>>
>>>
>>> Typically all at once.  What I try and do is commit a large set of
>>> related changes in a single commit, so I wouldn't mix the fixes to
>>> ephemerons with, say, fixes to time.  But equally I'd commit any changes I
>>> made to help debugging in the same commit.  And all rules are made to be
>>> broken ;-).
>>>
>>> ok!
>>>
>>>
>>> Hey,  _thank you_!  This is really great and ephemerons won't be usable
>>> without testing like this.  I'm in your debt.
>>>
>>>
>>> Ha, there is no debt (or at least it would be in the opposite direction).
>>>
>>> Now I'm working on another crash that I can easily reproduce using this
>>> script:
>>>
>>> Smalltalk supportsQueueingFinalization: true.
>>> e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer
>>> receiver: 'test', 'asd' selector: #logCr) value: Object new ].
>>> Smalltalk garbageCollect.
>>>
>>
> Why the extra level of indirection?  The Ephemeron /is/ the object
> finaliser.  It can send finalise to its key.  Why bother with the extra
> level of wrapper?  It's wasteful; especially if we're attaching lots of
> ephemerons to things cuz we want to finalise something that has lots of
> instances.
>

Ignore this stupid question.  I see it's an old facility.  And it is a
helpful example :-).  Sorry for the noise!

>
>
>> While debugging that in the simulator, we saw with clement that during
>>> #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are
>>> always firing, tenuring and scavenging ephemerons regardless they are
>>> marked or not.
>>>
>>> fireEphemeronsInRememberedSet
>>> [SNIP]
>>>          coInterpreter fireEphemeron: ephemeron.
>>>          manager
>>>             storePointerUnchecked: 0
>>>             ofObject: ephemeron
>>>             withValue: (self copyAndForward: (manager keyOfEphemeron:
>>> ephemeron)).
>>>         (self scavengeReferentsOf: ephemeron)
>>> [SNIP]
>>>
>>> So I put an assertion to check that the key of the ephemeron is marked
>>> when scavenging and it is not always the case...
>>>          self assert: (manager isMarked: (manager keyOfEphemeron:
>>> ephemeron)).
>>>
>>> Is the code wrong? or I'm missing something?
>>>
>>
>> The implementation of ephemerons in the scavenger and the implementation
>> in mark-sweep are quite different.
>>
>> Remember that what we're trying to do is find out if an object is only
>> referenced from the transitive closure of ephemerons or not.  And the way
>> that we do that in both cases is avoid processing ephemerons whose keys are
>> not yet reachable from the roots until all objects reachable from the roots
>> have been reached.  We do this by putting "unscanned" ephemerons in a
>> queue, saving them until later.
>>
>> In the scavenger the roots are the objects in the remembered set, the
>> interpreter state (newMethod etc) and the stack zone, and GC is performed
>> by copying all objects reachable from these roots in past and new spaces
>> into future space, possibly tenuring overflowed objects into old space.  In
>> the mark-sweep the roots are the specialObjectsArray, the interpreter state
>> (newMethod etc) and the stack zone.
>>
>> In the scavenger (a copying collector) this means that when we process
>> the unscanned ephemerons their keys will either have been copied into
>> future space or tenured to old space, in which case they were reachable
>> from the roots, or they will not have been copied yet, in which case they
>> are reachable only from ephemerons.  So in the scavenger marking is
>> irrelevant; in fact /no/ objects should be marked when scavenging.  The
>> important thing is whether a key is in past and new spaces or is in future
>> and old spaces.
>>
>> In the mark-sweep this means that when we process the unscanned
>> ephemerons their keys will either be marked, in which case they were
>> reachable from the roots, or unmarked, in which case they are reachable
>> only from ephemerons. So the assert for marked-ness only makes sense in
>> ephemeron processing in the mark-sweep collector.
>>
>> HTH
>>
>>
>>
>>> Guille
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> Guille
>>>>
>>>> -------- Original Message --------
>>>>
>>>> I see, I probably I need to leave the simulator run for a longer
>>>> time... Because running the image using the assertion VM took some time to
>>>> crash, it may be the same case.
>>>>
>>>> -------- Original Message --------
>>>>
>>>> Some advance on my side,
>>>>
>>>> - Yes, apparently the image was already broken. I started a new one and
>>>> I can open it with the simulator. It seems that as soon as I activate
>>>> Ephemerons and I have alive ephemerons the heap corrupts (and since saving
>>>> the image fires a gc...).
>>>>
>>>> - Then, to catch the moment where the image breaks, I prepared my image
>>>> for debug with the following script:
>>>>
>>>> Smalltalk snapshotPrimitive.
>>>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>>> d := EphemeronDictionary new.
>>>> d at: f put: f.
>>>> f := nil.
>>>> Smalltalk supportsQueueingFinalization: true.
>>>> 1 to: 2 do: [ :i | Smalltalk garbageCollect].
>>>>
>>>> Then I opened it with the simulator (latest version:
>>>> VMMaker.oscog-tfel.1864) and I had several assertion failures in
>>>>
>>>> keyOfEphemeron: objOop
>>>>     "Answer the object the ephemeron guards.  This is its first
>>>> element."
>>>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>>>> objOop]).
>>>>     ^self fetchPointer: 0 ofObject: objOop
>>>>
>>>> followedKeyOfEphemeron: objOop
>>>>     "Answer the object the ephemeron guards.  This is its first
>>>> element."
>>>>     self assert: ((self isNonImmediate: objOop) and: [self isEphemeron:
>>>> objOop]).
>>>>     ^self followOopField: 0 ofObject: objOop
>>>>
>>>> So, I noticed the following:
>>>>
>>>> - Ephemerons in the unscanned list are marked, fired, and the format of
>>>> ephemerons is changed to a non-ephemeron format
>>>> - But they are not removed from the unscanned list
>>>> - then the mark #markAllUnscannedEphemerons fails tons of assertions
>>>> because it expects ephemeron objects and the (not-anymore-)ephemeron is
>>>> still there in the unscanned list
>>>>
>>>>      self markInactiveEphemerons ifFalse:
>>>>         [self fireAllUnscannedEphemerons].
>>>>      self markAllUnscannedEphemerons
>>>>
>>>> The strange thing is that if I ignore the assertions by proceeding, the
>>>> thing continues normally and it the image does not break...
>>>>
>>>> So I then compiled the VM, with the latest version, and in trace mode
>>>> to compare. I attach the log of the stdout. But here there is a résumé
>>>>
>>>> * First I see the same assertions I saw in the simulator
>>>>
>>>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155
>>>>
>>>> (isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477
>>>>
>>>> * Then there are tons of repetitions of the same assertion:
>>>>
>>>> !(((GIV(mournQueue) != GIV(nilObj)) &&
>>>> (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030
>>>>
>>>> Which I understand it means that we are adding to the mourn queue all
>>>> the time the same object. Which is strange because it is a WeakArray with a
>>>> single slot which turns out to be nil (at least at the moment I inspected
>>>> it).
>>>>
>>>> * Then something breaks (!!) Looks like we are trying to mark a
>>>> forwarder. And then I have a follow up of assertions which are probably
>>>> caused by the same thing.
>>>>
>>>> !(isForwarded(field)) 49489
>>>>
>>>> (classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537
>>>>
>>>> (classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun()))
>>>> 49403
>>>>
>>>> objCouldBeClassObj(classObj) 49415
>>>>
>>>> addressCouldBeOop(objOop) 59997
>>>>
>>>> addressCouldBeObj(topOfObjStack((0 ==
>>>> (fetchPointerofObject(ObjStackTopx, objStack)) ?
>>>> fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002
>>>>
>>>> addressCouldBeOop(objOop) 59997
>>>>
>>>> * Finally there are some more
>>>>
>>>> !(isForwarded(field)) 49489
>>>>
>>>> until a segmentation fault arrives while executing a fullGC
>>>>
>>>> ../results/pharo[0x80ceaa4]
>>>> ../results/pharo[0x80cfde4]
>>>> ../results/pharo(fullGC+0x9b)[0x80c95ea]
>>>> ../results/pharo[0x80e2d49]
>>>> ../results/pharo[0x80e5f4a]
>>>> ../results/pharo[0x80ebca3]
>>>> ../results/pharo(ceStackOverflow+0xd7)[0x809356a]
>>>> [0x91002c6]
>>>> ../results/pharo(interpret+0x128)[0x8082b38]
>>>> ../results/pharo(main+0x23a)[0x810a7db]
>>>>
>>>> /lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
>>>> [0x1000]
>>>>
>>>> I'll continue digging a bit more,
>>>> Guille
>>>>
>>>> -------- Original Message --------
>>>>
>>>> Hi Guille,
>>>>
>>>> On Thu, May 19, 2016 at 12:53 AM, Guille Polito <
>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> -------- Original Message --------
>>>>>
>>>>> Hi Guille,
>>>>>
>>>>> On Wed, May 18, 2016 at 12:46 AM, Guille Polito <
>>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> -------- Original Message --------
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Guille, Hi Pablo (and welcome),
>>>>>>
>>>>>> On Tue, May 17, 2016 at 8:37 AM, Guille Polito <
>>>>>> <guillermopolito at gmail.com>guillermopolito at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Eliot, list
>>>>>>>
>>>>>>> I'm working here with Pablo (Tesone) on moving forward the Ephemeron
>>>>>>> implementation.
>>>>>>
>>>>>>
>>>>>
>>>>> Where's "here"?   Are you in Lille?
>>>>>
>>>>> Yup. Pablo is a new Phd student here :)
>>>>>
>>>>>
>>>>>
>>>>> We first installed Eliot's changeset, added a #mourn method and an
>>>>>>> EphemeronDictionary collection, and then started testing something like
>>>>>>> this:
>>>>>>>
>>>>>>> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>>>>>> d := EphemeronDictionary new.
>>>>>>>
>>>>>>> d at: f put: f.
>>>>>>>
>>>>>>> f := nil.
>>>>>>> Smalltalk garbageCollect.
>>>>>>>
>>>>>>
>>>>>> So this looks like something simulate able.  Are you able to use the
>>>>>> simulator?  If not, why not?
>>>>>>
>>>>>>
>>>>>> For some reason I have that bytesToShift when opening the image is
>>>>>> negative.
>>>>>>
>>>>>
>>>>> That is to be expected.  In the real VM the heap is located somewhere
>>>>> well above the bottom of the address space, typically above the program
>>>>> code.  In the simulator the heap is located either at 0 (an interpreter or
>>>>> stack VM) or immediately above the code zone (in a Cogit VM).  So when an
>>>>> image that has been saved on the real VM is loaded into the simulator all
>>>>> oops have to be adjusted down and hence bytesToShift is negative.
>>>>>
>>>>>
>>>>>>     bytesToShift := objectMemory memoryBaseForImageRead -
>>>>>> oldBaseAddr.  "adjust pointers for zero base address"
>>>>>>
>>>>>> So I cannot continue loading because addresses become negative and I
>>>>>> have "Improper Store into indexable object kind of errors".
>>>>>>
>>>>>
>>>>> Can you post a back trace?  Where is this happening?
>>>>>
>>>>> I'm working in latest pharo (5.0), using the same source code that is
>>>>> used in the Pharo CI jobs to build the VM (pharo branded). I would expect
>>>>> this branch to be stable, maybe it could be a bit behind your branch of
>>>>> development, but I think Esteban merges regularly.
>>>>>
>>>>
>>>> Esteban is very busy trying to get UFFI working, and having just got
>>>> Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if
>>>> you used the tip of VMMaker.oscog.  Further, that's the package that
>>>> Clément is using and he's your expert source of local help.
>>>>
>>>>
>>>>
>>>>>
>>>>> I'm executing this,
>>>>>
>>>>> vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory
>>>>> #Spur32BitMemoryManager ).
>>>>> vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer
>>>>> stack pages."
>>>>> (vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
>>>>>     openAsMorph; run.
>>>>>
>>>>> And here is the trace. The error happens while swizzling.
>>>>>
>>>>> LittleEndianBitmap(Object)>>error:
>>>>> LittleEndianBitmap(Object)>>errorImproperStore
>>>>> LittleEndianBitmap(Object)>>at:put:
>>>>> Spur32BitMMLESimulator>>longAt:put:
>>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
>>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
>>>>> Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
>>>>> StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
>>>>> [ self initializeInterpreter: bytesToShift ] in
>>>>> StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory:
>>>>> in Block: [ self initializeInterpreter: bytesToShift ]
>>>>>
>>>>
>>>>  So the image is presumably already broken?  Looks to me like it was
>>>> corrupted when saved.   Can you reproduce the problem that led to this
>>>> image being broken from a working image?
>>>>
>>>>> What version of VMMaker.oscog are you using?
>>>>>
>>>>> Monticello reports VMMaker.oscog-eem.1855.
>>>>> However, I'm not sure I have the right number, because source code is
>>>>> now in git.
>>>>>
>>>>
>>>> It would be so great if you could use Monticello, not git, and hence be
>>>> able to contribute back.
>>>>
>>>>
>>>>>   Are you running in Pharo or Squeak?  If you're in Lille you could
>>>>> perhaps visit Clément's office and get him to take a look.  Clément, would
>>>>> that be ok?
>>>>>
>>>>> I'll maybe see it with clement, but today I have to work on some other
>>>>> project...
>>>>>
>>>>> When debugging the VM there are two main levels of support, one is the
>>>>>> simulator where there is maximum support for debugging:
>>>>>> - asserts on all the time
>>>>>> - arbitrary breakpoints
>>>>>> - attempting every GC in a copy of the heap before doing the real GC
>>>>>> so that bugs in the GC can be investigated without needing to construct a
>>>>>> reproducible case after a crash
>>>>>> - the Smalltalk environment to inspect and browse
>>>>>>
>>>>>> The next level is the assert and debug VMs.  If you look in the build
>>>>>> directories on the Cog svn branch you'll see that all of them build three
>>>>>> VMs, a production VM with maximum optimisation and asserts excluded, an
>>>>>> assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts
>>>>>> enabled.  So if you either don't see the bug in the simulator, or the
>>>>>> simulator is too slow for the case being examined, or if the bug doesn't
>>>>>> show up in the simulator (the worst of all worlds), build both assert and
>>>>>> debug VMs and run with the assert VM first.
>>>>>>
>>>>>> Well so far we were using a VM compiled for debug with a graphical C
>>>>>> debugger. It was not so bad. However, I cannot say I'm missing a better
>>>>>> debugger.
>>>>>>
>>>>>
>>>>> "Compiled for debug" is vague.  Do you mean it is compiled with -g
>>>>> -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
>>>>>
>>>>>
>>>>> That would exactly be:
>>>>>
>>>>> { '-g3'.
>>>>>         '-O0'.
>>>>>         '-msse2'.
>>>>>         '-D_GNU_SOURCE'.
>>>>>         '-DITIMER_HEARTBEAT=1'.
>>>>>         '-DNO_VM_PROFILE=1'.
>>>>>         '-DDEBUGVM=1'.
>>>>>         '-DNDEBUG' }
>>>>>
>>>>> And Esteban told me to compile it with '-UNDEBUG' to get more
>>>>> information, but I did not try this yet.
>>>>>
>>>>
>>>> Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert
>>>> or debug VM unless you do not use -NDEBUG.
>>>>
>>>> Thanks!
>>>>>
>>>>> Note that there is a heap leak checker which can be enabled both in
>>>>>> the simulator and the assert and debug VMs.  See the checkForLeaks method
>>>>>> and the -leakcheck argument.
>>>>>>
>>>>>>
>>>>>> ok!
>>>>>>
>>>>>>
>>>>>> Without the simulator or the assert and debug VMs you are flying
>>>>>> blind.  It is /really/ productive to use the simulator for debugging,
>>>>>> provided the bug is reproducible within a short amount of time, as for
>>>>>> example your case is above.
>>>>>>
>>>>>>
>>>>>> Ok, gotcha! By this afternoon I'll have some news probably.
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> However, as soon as we garbage collect twice, we have a VM crash. We
>>>>>>> started debugging the VM to see if we could have some more clues.
>>>>>>>
>>>>>>> The first thing we noticed is that the first time the GC runs, the
>>>>>>> mournQueue is nil. This is of course expected because the new finalization
>>>>>>> mechanism was not active and then there was no need to create the
>>>>>>> mournQueue. We saw that the mournQueue is actually created in a lazy
>>>>>>> fashion when putting queuing a mourned object (I refer myself to
>>>>>>> #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC
>>>>>>> passes, the mournQueue is there. So far ok, but still crashing.
>>>>>>>
>>>>>>> The crash happens in the call to
>>>>>>>
>>>>>>> markAndTraceObjStackandContents(GIV(mournQueue), 1);
>>>>>>>
>>>>>>> after the
>>>>>>>
>>>>>>>     if (!markAndTraceContents) {
>>>>>>>        return;
>>>>>>>     }
>>>>>>>
>>>>>>> But when understanding why, it starts being less clear to us :). We
>>>>>>> used the printObjStack() function and we saw that:
>>>>>>>
>>>>>>> call printObjStack(markStack)
>>>>>>> call printObjStack(weaklingStack)
>>>>>>>
>>>>>>> and we saw in the console some output that makes sense. However,
>>>>>>> printing the mournQueue in the same manner produces some strange output
>>>>>>>
>>>>>>> call printObjStack(mournQueue)
>>>>>>>
>>>>>>> head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098
>>>>>>> (4098) unmkd
>>>>>>>     topx: 14 next:        0x0 free:        0x0
>>>>>>>
>>>>>>> We noticed that free and next are 0x0 while the others are not...
>>>>>>>
>>>>>>> Finally we saw there is isValidObjStack(), that gave us the
>>>>>>> following results:
>>>>>>>
>>>>>>> call isValidObjStack(markStack) => 1
>>>>>>>
>>>>>>> call isValidObjStack(weaklingStack) => 0
>>>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>>>
>>>>>>> call isValidObjStack(mournQueue) => 0
>>>>>>> p objStackInvalidBecause = "marking but page is unmarked"
>>>>>>>
>>>>>>>
>>>>>>> So we assume that the stack creation is wrong? We are a bit lost in
>>>>>>> here.
>>>>>>>
>>>>>>> Guille and Pablo
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> _,,,^..^,,,_
>>>>>> best, Eliot
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> _,,,^..^,,,_
>>>>> best, Eliot
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> _,,,^..^,,,_
>>> best, Eliot
>>>
>>>
>>>
>>>
>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160525/1e616770/attachment-0001.htm