[Vm-dev] Reproducible Cog crash from image startup

Igor Stasenko siguctua at gmail.com
Tue Feb 28 12:13:16 UTC 2012


On 27 February 2012 22:15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> let me retry *with* the attachment :(
>
> On Mon, Feb 27, 2012 at 12:03 PM, Igor Stasenko <siguctua at gmail.com> wrote:
>>
>> On 27 February 2012 10:53, Mariano Martinez Peck <marianopeck at gmail.com>
>> wrote:
>> >
>> >
>> >
>> > On Mon, Feb 27, 2012 at 5:20 AM, Eliot Miranda <eliot.miranda at gmail.com>
>> > wrote:
>> >>
>> >> Hi Mariano,
>> >>
>> >> On Sun, Feb 26, 2012 at 8:58 AM, Mariano Martinez Peck
>> >> <marianopeck at gmail.com> wrote:
>> >>>
>> >>>
>> >>> Hi. I have faced a VM crash while using Nautilus browser. It took me a
>> >>> while, but I finally could make a reproducible crash from image startup. You
>> >>> can find the image here:
>> >>>
>> >>> https://gforge.inria.fr/frs/download.php/30280/Marea.104-Crash.1.image.zip
>> >>>
>> >>> What the image is running at startup that causes the crash is:
>> >>>
>> >>> | nautilus model ui|
>> >>> Nautilus instVarNamed: 'groups' put: nil.
>> >>> model := Nautilus open.
>> >>> ui := model ui.
>> >>> ui groupsButtonAction.
>> >>>
>> >>> If you need more about the "domain", we can ask Ben, Nautilus
>> >>> developer.  From what I can see in GDB, it crashes in #mapStackPages
>> >>> because it does a remap to an OOP that is 0 (zero)
>> >>>
>> >>> while (theSP <= frameRcvrOffset) {
>> >>>                     oop = longAt(theSP);
>> >>>                     if (!((oop & 1))) {
>> >>>                         longAtput(theSP, remap(oop));
>> >>>                     }
>> >>>                     theSP += BytesPerWord;
>> >>>                 }
>> >>>
>> >>>
>> >>> Any ideas?
>> >>
>> >>
>> >> The image overflows the weakRoots table in scanning stack pages.  The
>> >> weakRoots table registers weak objects for scanning at the end of a GC.  It
>> >> is, unfortunately, fixed size (~2600 entries), and there are lots of
>> >> WeakMessageSends and WeakAnnouncementSubscriptions on the stack.
>> >>
>> >> I found this using aDebug VM with assert enabled (i.e. compiled with
>> >> NDEBUG /not/ defined).  I increased the table size to 3000 then 6000 before
>> >> finding it no longer crashed with a weakRoots  table size of 12000.
>> >>
>> >
>> > wow, I never imagine about that.
>> >
>> >>
>> >> a) Looks like weakRoots' size should be configurable either via a
>> >> start-up flag or an image header constant (with e.g. vmParameter accessors).
>> >
>> >
>> > yes, with vmParameter would be nice, like the external semaphore table.
>> >
>> >>
>> >>
>> >> b) overflowing the weakRoots table (and possibly other tables) should
>> >> probably cause the VM to abort with a useful error message.
>> >>
>> >
>> > please!  :)
>> >
>> > I have check in the image, before reproducing the bug, and it is not
>> > that bad:
>> >
>> > WeakMessageSend instanceCount 755.
>> > WeakAnnouncementSubscription instanceCount 538
>> >
>> > So...maybe when I do the stuff that reproduces the crash there is
>> > ANOTHER bug (say a loop for example), that cause to have much more instances
>> > of those weak stuff?
>> >
>> >
>> hmm.. i hardly believe that UI needs such amount of weak messages to
>> wire the stuff.. but it is hard to tell, since i'm not an author.
>
>
>
> Take a look at the attached.  It is taken form the image at a point where an
> incrementalGC is performed when the weakRootTable has 6000 or more elements.
>  It shows a very deep call stack full of WeakAnnouncementSubscriptions.
>
>
>>
>> Also, answering Stephane's question: AFAIK, a weak roots table size is
>>
>> not liearly depending on the total number of all weak containers in
>> your image.
>
>
> The number of weak containers does define an upper bound on the size of the
> table.  It doesn't necessarily correlate to how many containers are
> encountered in an incremental GC.
>
>
>> But i might be wrong.
>> Eliot, can you please explain how this weak roots table populated and
>> what triggers addition of new element(s) to it, and freeing the entry.
>
>
>
> So when an incremental GC is performed, any weak collections encountered
> must be scanned later, after the mark phase of non-objects have completed,
> so that the GC can discover which elements of weak collections are unmarked
> and nil these collections.  So in markAndTrace any encountered weak objects
> get added as "roots" to the weakRootsTable.  Later (either in incrementalGC
> or fullGC) the weak table is processed and unmarked referents in the weak
> arrays in the weak table are nilled.  Hence the weak table fills during the
> mark phase and is emptied in the nilling phase.
>
> But in reading the code more carefully I notice that the weak roots table is
> not used during a full GC.  Instead, during a fullGC nilling is done as each
> weak container is encountered.  I don't understand how this works yet.
>  Anyone care to explain?
>
yes, i noticed that too.

Weak roots are used only during incremental GC but not full GC, which
makes sense:

incrementalGC
...
weakRootCount := 0.
	statSweepCount := statMarkCount := statMkFwdCount := statCompMoveCount := 0.
	self markPhase: false.
	self assert: weakRootCount <= WeakRootTableSize.
	1 to: weakRootCount do:
		[:i| self finalizeReference: (weakRoots at: i)].
...

fullGC
...
	statSweepCount := statMarkCount := statMkFwdCount := statCompMoveCount := 0.
	self clearRootsTable.
	youngStart := self startOfMemory.  "process all of memory"
	self markPhase: true.
	"Sweep phase returns the number of survivors.
	Use the up-to-date version instead the one from startup."
	totalObjectCount := self sweepPhaseForFullGC.
...

The bad thing is that , #markAndTrace: is used in both #incrementalGC
and #fullGC
and calls:
self lastPointerOf: oop recordWeakRoot: true.

and that last 'true' is evil, because it will populate weak roots into
table even during full GC
which means that a total number of weak containers in a whole image
should not be more than (err.. how much you said?)
otherwise *!crash!*.
Also, note that #fullGC method even doesn't cares to reset
weakRootCount to zero, which means that
it can be any value, which was the last number of weak containers
found during last incremental GC ,
so its enough to have (1+WeakRootTableSize/2) weak containers
discovered by incremental GC,
 for full GC to start corrupting a memory not saying about higher numbers.

To fix that , a #fullGC should either use different #markAndTrace:
method, which won't records weak roots,
or actually anything, which will lead to following change:

self lastPointerOf: oop recordWeakRoot: true.
to be:
self lastPointerOf: oop recordWeakRoot: (fullGC not).

i think it is easy to imagine an image with lots of weak containers,
while only few of them (a reasonable number ;)
can be discovered during single incremental GC.

Another thing, which i would do is to trigger a full GC, if during
incremental GC a weak roots table overflows.
But its hard to tell, how easy to implement abortion of marking phase
due to table overflow..


>>
>> And is the weak roots table size limit reasonably good? Needless to
>> say, that nobody likes when system hits the wall of hardcoded limits.
>
>
>
> Hmmm... In VisualWorks, which has a two-space copying generational GC there
> is no weak root table during incremental GC.  Instead the list of weak
> objects is threaded through the corpses left behind in from space.  So at
> least for some GC designs a weak roots table isn't even needed.  I will copy
> this scheme in my new object representation/GC.  But for now I think just
> providing a parameter to determine the maximum size is sufficient.
>

i actually wonder do we really need this structure at all.
an incremental GC works similarly to full one, except that it operates
on smaller heap size.
so, this piece of code:
	1 to: weakRootCount do:
		[:i| self finalizeReference: (weakRoots at: i)].

can be replaced by heap-walking procedure from:

  youngStart to: memoryEnd

finding all weak containers and checking them.
(we can even leave the counter, so if count = 0, you don't do heap walk,
and if it >0 , you decrement it by 1 when discovering weak container object,
so heap walk will terminate once counter will reach 0).

> --
> best,
> Eliot

-- 
Best regards,
Igor Stasenko.


More information about the Vm-dev mailing list