[Vm-dev] Reproducible Cog crash from image startup

Tue Feb 28 12:51:30 UTC 2012

On 28 February 2012 13:13, Igor Stasenko <siguctua at gmail.com> wrote:
> On 27 February 2012 22:15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>> let me retry *with* the attachment :(
>>
>> On Mon, Feb 27, 2012 at 12:03 PM, Igor Stasenko <siguctua at gmail.com> wrote:
>>>
>>> On 27 February 2012 10:53, Mariano Martinez Peck <marianopeck at gmail.com>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Feb 27, 2012 at 5:20 AM, Eliot Miranda <eliot.miranda at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi Mariano,
>>> >>
>>> >> On Sun, Feb 26, 2012 at 8:58 AM, Mariano Martinez Peck
>>> >> <marianopeck at gmail.com> wrote:
>>> >>>
>>> >>>
>>> >>> Hi. I have faced a VM crash while using Nautilus browser. It took me a
>>> >>> while, but I finally could make a reproducible crash from image startup. You
>>> >>> can find the image here:
>>> >>>
>>> >>> https://gforge.inria.fr/frs/download.php/30280/Marea.104-Crash.1.image.zip
>>> >>>
>>> >>> What the image is running at startup that causes the crash is:
>>> >>>
>>> >>> | nautilus model ui|
>>> >>> Nautilus instVarNamed: 'groups' put: nil.
>>> >>> model := Nautilus open.
>>> >>> ui := model ui.
>>> >>> ui groupsButtonAction.
>>> >>>
>>> >>> If you need more about the "domain", we can ask Ben, Nautilus
>>> >>> developer.  From what I can see in GDB, it crashes in #mapStackPages
>>> >>> because it does a remap to an OOP that is 0 (zero)
>>> >>>
>>> >>> while (theSP <= frameRcvrOffset) {
>>> >>>                     oop = longAt(theSP);
>>> >>>                     if (!((oop & 1))) {
>>> >>>                         longAtput(theSP, remap(oop));
>>> >>>                     }
>>> >>>                     theSP += BytesPerWord;
>>> >>>                 }
>>> >>>
>>> >>>
>>> >>> Any ideas?
>>> >>
>>> >>
>>> >> The image overflows the weakRoots table in scanning stack pages.  The
>>> >> weakRoots table registers weak objects for scanning at the end of a GC.  It
>>> >> is, unfortunately, fixed size (~2600 entries), and there are lots of
>>> >> WeakMessageSends and WeakAnnouncementSubscriptions on the stack.
>>> >>
>>> >> I found this using aDebug VM with assert enabled (i.e. compiled with
>>> >> NDEBUG /not/ defined).  I increased the table size to 3000 then 6000 before
>>> >> finding it no longer crashed with a weakRoots  table size of 12000.
>>> >>
>>> >
>>> > wow, I never imagine about that.
>>> >
>>> >>
>>> >> a) Looks like weakRoots' size should be configurable either via a
>>> >> start-up flag or an image header constant (with e.g. vmParameter accessors).
>>> >
>>> >
>>> > yes, with vmParameter would be nice, like the external semaphore table.
>>> >
>>> >>
>>> >>
>>> >> b) overflowing the weakRoots table (and possibly other tables) should
>>> >> probably cause the VM to abort with a useful error message.
>>> >>
>>> >
>>> > please!  :)
>>> >
>>> > I have check in the image, before reproducing the bug, and it is not
>>> > that bad:
>>> >
>>> > WeakMessageSend instanceCount 755.
>>> > WeakAnnouncementSubscription instanceCount 538
>>> >
>>> > So...maybe when I do the stuff that reproduces the crash there is
>>> > ANOTHER bug (say a loop for example), that cause to have much more instances
>>> > of those weak stuff?
>>> >
>>> >
>>> hmm.. i hardly believe that UI needs such amount of weak messages to
>>> wire the stuff.. but it is hard to tell, since i'm not an author.
>>
>>
>>
>> Take a look at the attached.  It is taken form the image at a point where an
>> incrementalGC is performed when the weakRootTable has 6000 or more elements.
>>  It shows a very deep call stack full of WeakAnnouncementSubscriptions.
>>
>>
>>>
>>> Also, answering Stephane's question: AFAIK, a weak roots table size is
>>>
>>> not liearly depending on the total number of all weak containers in
>>> your image.
>>
>>
>> The number of weak containers does define an upper bound on the size of the
>> table.  It doesn't necessarily correlate to how many containers are
>> encountered in an incremental GC.
>>
>>
>>> But i might be wrong.
>>> Eliot, can you please explain how this weak roots table populated and
>>> what triggers addition of new element(s) to it, and freeing the entry.
>>
>>
>>
>> So when an incremental GC is performed, any weak collections encountered
>> must be scanned later, after the mark phase of non-objects have completed,
>> so that the GC can discover which elements of weak collections are unmarked
>> and nil these collections.  So in markAndTrace any encountered weak objects
>> get added as "roots" to the weakRootsTable.  Later (either in incrementalGC
>> or fullGC) the weak table is processed and unmarked referents in the weak
>> arrays in the weak table are nilled.  Hence the weak table fills during the
>> mark phase and is emptied in the nilling phase.
>>
>> But in reading the code more carefully I notice that the weak roots table is
>> not used during a full GC.  Instead, during a fullGC nilling is done as each
>> weak container is encountered.  I don't understand how this works yet.
>>  Anyone care to explain?
>>
> yes, i noticed that too.
>
> Weak roots are used only during incremental GC but not full GC, which
> makes sense:
>
> incrementalGC
> ...
> weakRootCount := 0.
>        statSweepCount := statMarkCount := statMkFwdCount := statCompMoveCount := 0.
>        self markPhase: false.
>        self assert: weakRootCount <= WeakRootTableSize.
>        1 to: weakRootCount do:
>                [:i| self finalizeReference: (weakRoots at: i)].
> ...
>
> fullGC
> ...
>        statSweepCount := statMarkCount := statMkFwdCount := statCompMoveCount := 0.
>        self clearRootsTable.
>        youngStart := self startOfMemory.  "process all of memory"
>        self markPhase: true.
>        "Sweep phase returns the number of survivors.
>        Use the up-to-date version instead the one from startup."
>        totalObjectCount := self sweepPhaseForFullGC.
> ...
>
> The bad thing is that , #markAndTrace: is used in both #incrementalGC
> and #fullGC
> and calls:
> self lastPointerOf: oop recordWeakRoot: true.
>
> and that last 'true' is evil, because it will populate weak roots into
> table even during full GC
> which means that a total number of weak containers in a whole image
> should not be more than (err.. how much you said?)
> otherwise *!crash!*.
> Also, note that #fullGC method even doesn't cares to reset
> weakRootCount to zero, which means that
> it can be any value, which was the last number of weak containers
> found during last incremental GC ,
> so its enough to have (1+WeakRootTableSize/2) weak containers
> discovered by incremental GC,
>  for full GC to start corrupting a memory not saying about higher numbers.
>
> To fix that , a #fullGC should either use different #markAndTrace:
> method, which won't records weak roots,
> or actually anything, which will lead to following change:
>
> self lastPointerOf: oop recordWeakRoot: true.
> to be:
> self lastPointerOf: oop recordWeakRoot: (fullGC not).
>
> i think it is easy to imagine an image with lots of weak containers,
> while only few of them (a reasonable number ;)
> can be discovered during single incremental GC.
>
> Another thing, which i would do is to trigger a full GC, if during
> incremental GC a weak roots table overflows.
> But its hard to tell, how easy to implement abortion of marking phase
> due to table overflow..
>
>
>>>
>>> And is the weak roots table size limit reasonably good? Needless to
>>> say, that nobody likes when system hits the wall of hardcoded limits.
>>
>>
>>
>> Hmmm... In VisualWorks, which has a two-space copying generational GC there
>> is no weak root table during incremental GC.  Instead the list of weak
>> objects is threaded through the corpses left behind in from space.  So at
>> least for some GC designs a weak roots table isn't even needed.  I will copy
>> this scheme in my new object representation/GC.  But for now I think just
>> providing a parameter to determine the maximum size is sufficient.
>>
>
> i actually wonder do we really need this structure at all.
> an incremental GC works similarly to full one, except that it operates
> on smaller heap size.
> so, this piece of code:
>        1 to: weakRootCount do:
>                [:i| self finalizeReference: (weakRoots at: i)].
>
> can be replaced by heap-walking procedure from:
>
>  youngStart to: memoryEnd
>
> finding all weak containers and checking them.
> (we can even leave the counter, so if count = 0, you don't do heap walk,
> and if it >0 , you decrement it by 1 when discovering weak container object,
> so heap walk will terminate once counter will reach 0).
>

A dirty fix to this would be directly in :

lastPointerOf: oop recordWeakRoot: recordWeakRoot "<Boolean>"
....
[recordWeakRoot ifTrue:
					["And remember as weak root"
					 weakRootCount := weakRootCount + 1.

					( weakRootCount <= WeakRootTableSize) ifTrue: [
					 weakRoots at: weakRootCount put: oop
          ] ifFalse: [
               self isFullGC
                     ifFalse: [ self handleWeakRootsTableOverflow ]
"and  just ignore if full gc"
          ]
].

now if we put:

weakRootCount := WeakRootTableSize + 1.

in #fullGC method ,  this will result in ignoring weak roots during
full gc mark phase.

ohh ... well, since i'm using #isFullGC
it actually makes sense to just put:

(recordWeakRoot and: [ isFullGC not ])

to get bug busted.
Just not sure, if extra memory read (checking full gc flag ) will
impact the performance that much..

actually for performance, its better to actually 'call' #markAndTrace:
with right argument,
then it will be inlined and compiler will produce no-op there.

-- 
Best regards,
Igor Stasenko.