[Vm-dev] Ephemerons and VM crash

Guille Polito guillermopolito at gmail.com
Wed May 25 13:24:10 UTC 2016


Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
> 1 - The mourn table is being collected because it is not marked upon 
> creation
>
>     ===============
>
> I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to 
put them without breaking something :)
>
>
>     2 - Accesses to unscannedEphemerons are unaligned
>     ===============
>
> Looks good.  I've changed all the ephemeron methods to use 
> bytesPerOop, cuz the ephemeron queue contains oops, but your fix is 
> just as good.

cool! I thought so that would be the best.
>
>
>
>     3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
>     ==========================
>
>     Question: besides reviewing, how should I commit these fixes? one
>     issue per commit? all at once?
>
>
> Typically all at once.  What I try and do is commit a large set of 
> related changes in a single commit, so I wouldn't mix the fixes to 
> ephemerons with, say, fixes to time.  But equally I'd commit any 
> changes I made to help debugging in the same commit.  And all rules 
> are made to be broken ;-).
ok!
>
> Hey,  _thank you_!  This is really great and ephemerons won't be 
> usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this 
script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer 
receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.


While debugging that in the simulator, we saw with clement that during 
#fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are 
always firing, tenuring and scavenging ephemerons regardless they are 
marked or not.

fireEphemeronsInRememberedSet
[SNIP]
          coInterpreter fireEphemeron: ephemeron.
          manager
             storePointerUnchecked: 0
             ofObject: ephemeron
             withValue: (self copyAndForward: (manager keyOfEphemeron: 
ephemeron)).
         (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked 
when scavenging and it is not always the case...
          self assert: (manager isMarked: (manager keyOfEphemeron: 
ephemeron)).

Is the code wrong? or I'm missing something?

Guille

>
>
>     Guille
>
>     -------- Original Message --------
>>     I see, I probably I need to leave the simulator run for a longer
>>     time... Because running the image using the assertion VM took
>>     some time to crash, it may be the same case.
>>
>>     -------- Original Message --------
>>>     Some advance on my side,
>>>
>>>     - Yes, apparently the image was already broken. I started a new
>>>     one and I can open it with the simulator. It seems that as soon
>>>     as I activate Ephemerons and I have alive ephemerons the heap
>>>     corrupts (and since saving the image fires a gc...).
>>>
>>>     - Then, to catch the moment where the image breaks, I prepared
>>>     my image for debug with the following script:
>>>
>>>     Smalltalk snapshotPrimitive.
>>>     f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
>>>     d := EphemeronDictionary new.
>>>     d at: f put: f.
>>>     f := nil.
>>>     Smalltalk supportsQueueingFinalization: true.
>>>     1 to: 2 do: [ :i | Smalltalk garbageCollect].
>>>
>>>     Then I opened it with the simulator (latest version:
>>>     VMMaker.oscog-tfel.1864) and I had several assertion failures in
>>>
>>>     keyOfEphemeron: objOop
>>>         "Answer the object the ephemeron guards. This is its first
>>>     element."
>>>         self assert: ((self isNonImmediate: objOop) and: [self
>>>     isEphemeron: objOop]).
>>>         ^self fetchPointer: 0 ofObject: objOop
>>>
>>>     followedKeyOfEphemeron: objOop
>>>         "Answer the object the ephemeron guards. This is its first
>>>     element."
>>>         self assert: ((self isNonImmediate: objOop) and: [self
>>>     isEphemeron: objOop]).
>>>         ^self followOopField: 0 ofObject: objOop
>>>
>>>     So, I noticed the following:
>>>
>>>     - Ephemerons in the unscanned list are marked, fired, and the
>>>     format of ephemerons is changed to a non-ephemeron format
>>>     - But they are not removed from the unscanned list
>>>     - then the mark #markAllUnscannedEphemerons fails tons of
>>>     assertions because it expects ephemeron objects and the
>>>     (not-anymore-)ephemeron is still there in the unscanned list
>>>
>>>          self markInactiveEphemerons ifFalse:
>>>             [self fireAllUnscannedEphemerons].
>>>          self markAllUnscannedEphemerons
>>>
>>>     The strange thing is that if I ignore the assertions by
>>>     proceeding, the thing continues normally and it the image does
>>>     not break...
>>>
>>>     So I then compiled the VM, with the latest version, and in trace
>>>     mode to compare. I attach the log of the stdout. But here there
>>>     is a résumé
>>>
>>>     * First I see the same assertions I saw in the simulator
>>>
>>>     (isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155
>>>
>>>     (isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477
>>>
>>>     * Then there are tons of repetitions of the same assertion:
>>>
>>>     !(((GIV(mournQueue) != GIV(nilObj)) &&
>>>     (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030
>>>
>>>     Which I understand it means that we are adding to the mourn
>>>     queue all the time the same object. Which is strange because it
>>>     is a WeakArray with a single slot which turns out to be nil (at
>>>     least at the moment I inspected it).
>>>
>>>     * Then something breaks (!!) Looks like we are trying to mark a
>>>     forwarder. And then I have a follow up of assertions which are
>>>     probably caused by the same thing.
>>>
>>>     !(isForwarded(field)) 49489
>>>
>>>     (classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537
>>>
>>>     (classIndex <= (tagMask())) || (classIndex >=
>>>     (arrayClassIndexPun())) 49403
>>>
>>>     objCouldBeClassObj(classObj) 49415
>>>
>>>     addressCouldBeOop(objOop) 59997
>>>
>>>     addressCouldBeObj(topOfObjStack((0 ==
>>>     (fetchPointerofObject(ObjStackTopx, objStack)) ?
>>>     fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002
>>>
>>>     addressCouldBeOop(objOop) 59997
>>>
>>>     * Finally there are some more
>>>
>>>     !(isForwarded(field)) 49489
>>>
>>>     until a segmentation fault arrives while executing a fullGC
>>>
>>>     ../results/pharo[0x80ceaa4]
>>>     ../results/pharo[0x80cfde4]
>>>     ../results/pharo(fullGC+0x9b)[0x80c95ea]
>>>     ../results/pharo[0x80e2d49]
>>>     ../results/pharo[0x80e5f4a]
>>>     ../results/pharo[0x80ebca3]
>>>     ../results/pharo(ceStackOverflow+0xd7)[0x809356a]
>>>     [0x91002c6]
>>>     ../results/pharo(interpret+0x128)[0x8082b38]
>>>     ../results/pharo(main+0x23a)[0x810a7db]
>>>     /lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
>>>     [0x1000]
>>>
>>>     I'll continue digging a bit more,
>>>     Guille
>>>
>>>     -------- Original Message --------
>>>>     Hi Guille,
>>>>
>>>>     On Thu, May 19, 2016 at 12:53 AM, Guille Polito
>>>>     <guillermopolito at gmail.com <mailto:guillermopolito at gmail.com>>
>>>>     wrote:
>>>>
>>>>
>>>>
>>>>         -------- Original Message --------
>>>>>         Hi Guille,
>>>>>
>>>>>         On Wed, May 18, 2016 at 12:46 AM, Guille Polito
>>>>>         <guillermopolito at gmail.com
>>>>>         <mailto:guillermopolito at gmail.com>> wrote:
>>>>>
>>>>>             Hello,
>>>>>
>>>>>             -------- Original Message --------
>>>>>>               
>>>>>>
>>>>>>
>>>>>>             Hi Guille, Hi Pablo (and welcome),
>>>>>>
>>>>>>             On Tue, May 17, 2016 at 8:37 AM, Guille Polito
>>>>>>             <guillermopolito at gmail.com
>>>>>>             <mailto:guillermopolito at gmail.com>> wrote:
>>>>>>
>>>>>>
>>>>>>                 Hi Eliot, list
>>>>>>
>>>>>>                 I'm working here with Pablo (Tesone) on moving
>>>>>>                 forward the Ephemeron implementation.
>>>>>>
>>>>>
>>>>>
>>>>>         Where's "here"?   Are you in Lille?
>>>>         Yup. Pablo is a new Phd student here :)
>>>>>
>>>>>
>>>>>>                 We first installed Eliot's changeset, added a
>>>>>>                 #mourn method and an EphemeronDictionary
>>>>>>                 collection, and then started testing something
>>>>>>                 like this:
>>>>>>
>>>>>>                 f := ObjectFinalizer receiver: 'Hello' selector:
>>>>>>                 #logCr.
>>>>>>                 d := EphemeronDictionary new.
>>>>>>
>>>>>>                 d at: f put: f.
>>>>>>
>>>>>>                 f := nil.
>>>>>>                 Smalltalk garbageCollect.
>>>>>>
>>>>>>
>>>>>>             So this looks like something simulate able.  Are you
>>>>>>             able to use the simulator?  If not, why not?
>>>>>
>>>>>             For some reason I have that bytesToShift when opening
>>>>>             the image is negative.
>>>>>
>>>>>
>>>>>         That is to be expected.  In the real VM the heap is
>>>>>         located somewhere well above the bottom of the address
>>>>>         space, typically above the program code.  In the simulator
>>>>>         the heap is located either at 0 (an interpreter or stack
>>>>>         VM) or immediately above the code zone (in a Cogit VM). 
>>>>>         So when an image that has been saved on the real VM is
>>>>>         loaded into the simulator all oops have to be adjusted
>>>>>         down and hence bytesToShift is negative.
>>>>>
>>>>>
>>>>>                 bytesToShift := objectMemory
>>>>>             memoryBaseForImageRead - oldBaseAddr. "adjust pointers
>>>>>             for zero base address"
>>>>>
>>>>>             So I cannot continue loading because addresses become
>>>>>             negative and I have "Improper Store into indexable
>>>>>             object kind of errors".
>>>>>
>>>>>
>>>>>         Can you post a back trace?  Where is this happening?
>>>>         I'm working in latest pharo (5.0), using the same source
>>>>         code that is used in the Pharo CI jobs to build the VM
>>>>         (pharo branded). I would expect this branch to be stable,
>>>>         maybe it could be a bit behind your branch of development,
>>>>         but I think Esteban merges regularly.
>>>>
>>>>
>>>>     Esteban is very busy trying to get UFFI working, and having
>>>>     just got Pharo 5 released and working on Pharo 6 to be
>>>>     released.  I'd prefer it if you used the tip of VMMaker.oscog.
>>>>     Further, that's the package that Clément is using and he's your
>>>>     expert source of local help.
>>>>
>>>>
>>>>         I'm executing this,
>>>>
>>>>         vm := StackInterpreterSimulator newWithOptions:
>>>>         #(#ObjectMemory #Spur32BitMemoryManager ).
>>>>         vm desiredNumStackPages: 8. "Makes simulation faster by
>>>>         creating fewer stack pages."
>>>>         (vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
>>>>             openAsMorph; run.
>>>>
>>>>         And here is the trace. The error happens while swizzling.
>>>>
>>>>         LittleEndianBitmap(Object)>>error:
>>>>         LittleEndianBitmap(Object)>>errorImproperStore
>>>>         LittleEndianBitmap(Object)>>at:put:
>>>>         Spur32BitMMLESimulator>>longAt:put:
>>>>         Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
>>>>         Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
>>>>         Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
>>>>         StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
>>>>         [ self initializeInterpreter: bytesToShift ] in
>>>>         StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory:
>>>>         in Block: [ self initializeInterpreter: bytesToShift ]
>>>>
>>>>
>>>>      So the image is presumably already broken?  Looks to me like
>>>>     it was corrupted when saved.   Can you reproduce the problem
>>>>     that led to this image being broken from a working image?
>>>>
>>>>>         What version of VMMaker.oscog are you using?
>>>>         Monticello reports VMMaker.oscog-eem.1855.
>>>>         However, I'm not sure I have the right number, because
>>>>         source code is now in git.
>>>>
>>>>
>>>>     It would be so great if you could use Monticello, not git, and
>>>>     hence be able to contribute back.
>>>>
>>>>>           Are you running in Pharo or Squeak?  If you're in Lille
>>>>>         you could perhaps visit Clément's office and get him to
>>>>>         take a look. Clément, would that be ok?
>>>>         I'll maybe see it with clement, but today I have to work on
>>>>         some other project...
>>>>>
>>>>>>             When debugging the VM there are two main levels of
>>>>>>             support, one is the simulator where there is maximum
>>>>>>             support for debugging:
>>>>>>             - asserts on all the time
>>>>>>             - arbitrary breakpoints
>>>>>>             - attempting every GC in a copy of the heap before
>>>>>>             doing the real GC so that bugs in the GC can be
>>>>>>             investigated without needing to construct a
>>>>>>             reproducible case after a crash
>>>>>>             - the Smalltalk environment to inspect and browse
>>>>>>
>>>>>>             The next level is the assert and debug VMs.  If you
>>>>>>             look in the build directories on the Cog svn branch
>>>>>>             you'll see that all of them build three VMs, a
>>>>>>             production VM with maximum optimisation and asserts
>>>>>>             excluded, an assert VM with -O1 and asserts enabled,
>>>>>>             and a debug VM with -O0 and asserts enabled.  So if
>>>>>>             you either don't see the bug in the simulator, or the
>>>>>>             simulator is too slow for the case being examined, or
>>>>>>             if the bug doesn't show up in the simulator (the
>>>>>>             worst of all worlds), build both assert and debug VMs
>>>>>>             and run with the assert VM first.
>>>>>>
>>>>>             Well so far we were using a VM compiled for debug with
>>>>>             a graphical C debugger. It was not so bad. However, I
>>>>>             cannot say I'm missing a better debugger.
>>>>>
>>>>>
>>>>>         "Compiled for debug" is vague.  Do you mean it is compiled
>>>>>         with -g -O0, or in addition is compiled with -g -O0
>>>>>         -DDEBUGVM=0 -DNDEBUG=1?
>>>>         That would exactly be:
>>>>
>>>>         { '-g3'.
>>>>                 '-O0'.
>>>>                 '-msse2'.
>>>>                 '-D_GNU_SOURCE'.
>>>>                 '-DITIMER_HEARTBEAT=1'.
>>>>                 '-DNO_VM_PROFILE=1'.
>>>>                 '-DDEBUGVM=1'.
>>>>                 '-DNDEBUG' }
>>>>
>>>>         And Esteban told me to compile it with '-UNDEBUG' to get
>>>>         more information, but I did not try this yet.
>>>>
>>>>
>>>>     Yes.  NDEBUG is what turns off asserts.  You don't have a
>>>>     proper assert or debug VM unless you do not use -NDEBUG.
>>>>
>>>>         Thanks!
>>>>
>>>>>>             Note that there is a heap leak checker which can be
>>>>>>             enabled both in the simulator and the assert and
>>>>>>             debug VMs. See the checkForLeaks method and the
>>>>>>             -leakcheck argument.
>>>>>
>>>>>             ok!
>>>>>>
>>>>>>             Without the simulator or the assert and debug VMs you
>>>>>>             are flying blind.  It is /really/ productive to use
>>>>>>             the simulator for debugging, provided the bug is
>>>>>>             reproducible within a short amount of time, as for
>>>>>>             example your case is above.
>>>>>
>>>>>             Ok, gotcha! By this afternoon I'll have some news
>>>>>             probably.
>>>>>
>>>>>             Thanks a lot!
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                 However, as soon as we garbage collect twice, we
>>>>>>                 have a VM crash. We started debugging the VM to
>>>>>>                 see if we could have some more clues.
>>>>>>
>>>>>>                 The first thing we noticed is that the first time
>>>>>>                 the GC runs, the mournQueue is nil. This is of
>>>>>>                 course expected because the new finalization
>>>>>>                 mechanism was not active and then there was no
>>>>>>                 need to create the mournQueue. We saw that the
>>>>>>                 mournQueue is actually created in a lazy fashion
>>>>>>                 when putting queuing a mourned object (I refer
>>>>>>                 myself to #queueMourner: and
>>>>>>                 #ensureRoomOnObjStackAt:). So the second time the
>>>>>>                 GC passes, the mournQueue is there. So far ok,
>>>>>>                 but still crashing.
>>>>>>
>>>>>>                 The crash happens in the call to
>>>>>>
>>>>>>                 markAndTraceObjStackandContents(GIV(mournQueue), 1);
>>>>>>
>>>>>>                 after the
>>>>>>
>>>>>>                     if (!markAndTraceContents) {
>>>>>>                        return;
>>>>>>                     }
>>>>>>
>>>>>>                 But when understanding why, it starts being less
>>>>>>                 clear to us :). We used the printObjStack()
>>>>>>                 function and we saw that:
>>>>>>
>>>>>>                 call printObjStack(markStack)
>>>>>>                 call printObjStack(weaklingStack)
>>>>>>
>>>>>>                 and we saw in the console some output that makes
>>>>>>                 sense. However, printing the mournQueue in the
>>>>>>                 same manner produces some strange output
>>>>>>
>>>>>>                 call printObjStack(mournQueue)
>>>>>>
>>>>>>                 head 0xb06e980 cx 18 (18) fmt 10 (10) sz 4092
>>>>>>                 (4092) myx: 4098 (4098) unmkd
>>>>>>                     topx: 14 next: 0x0 free:   0x0
>>>>>>
>>>>>>                 We noticed that free and next are 0x0 while the
>>>>>>                 others are not...
>>>>>>
>>>>>>                 Finally we saw there is isValidObjStack(), that
>>>>>>                 gave us the following results:
>>>>>>
>>>>>>                 call isValidObjStack(markStack) => 1
>>>>>>
>>>>>>                 call isValidObjStack(weaklingStack) => 0
>>>>>>                 p objStackInvalidBecause = "marking but page is
>>>>>>                 unmarked"
>>>>>>
>>>>>>                 call isValidObjStack(mournQueue) => 0
>>>>>>                 p objStackInvalidBecause = "marking but page is
>>>>>>                 unmarked"
>>>>>>
>>>>>>
>>>>>>                 So we assume that the stack creation is wrong? We
>>>>>>                 are a bit lost in here.
>>>>>>
>>>>>>                 Guille and Pablo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             -- 
>>>>>>             _,,,^..^,,,_
>>>>>>             best, Eliot
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         -- 
>>>>>         _,,,^..^,,,_
>>>>>         best, Eliot
>>>>
>>>>
>>>>
>>>>
>>>>     -- 
>>>>     _,,,^..^,,,_
>>>>     best, Eliot
>>>
>>
>
>
>
>
> -- 
> _,,,^..^,,,_
> best, Eliot

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160525/b1dc2ba3/attachment-0001.htm


More information about the Vm-dev mailing list