[squeak-dev] Crashes on snapshot with the new compactor

Ben Coman benjamin.t.coman at gmail.com
Mon Apr 17 07:51:43 UTC 2017


On 29 Mar 2017 10:25 PM, "Eliot Miranda" <eliot.miranda at gmail.com> wrote:
>
> Hi Ben,
>
>
> > On Mar 25, 2017, at 7:41 PM, Ben Coman <btc at openinworld.com> wrote:
> >
> >> On Sun, Mar 26, 2017 at 4:27 AM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:
> >> Hi All,
> >>
> >>    a number of people are being affected by crashes on snapshotting the
> >> image, the worst possible time for a crash.  There is a bug in the new
> >> compactor that unfortunately bites when saving.  The compactor is
invoked as
> >> part of a full garbage collect after the garbage collector has feed
> >> unreachable objects.  Normally the new compactor makes only a single
pass
> >> through the heap, which may not move all the objects that are possible
to
> >> move.  (The amount of objects that can be moved in a single pass is
limited
> >> by available free space.)  But on snapshot the compactor makes as may
passes
> >> as are necessary to slide all movable objects down as far as possible.
> >> Unfortunately there is a bug in this second pass.
> >>
> >> Fixing this bug is now my priority.  I have an example image from
Esteban
> >> Lorenzano to test.  I am asking anyone else that can provide an image
that
> >> reliably crashes when trying to save it to make the image and changes
> >> available to me for testing if possible.
> >>
> >> In the mean time one may be able to work around the problem by doing a
full
> >> garbage collect before snapshot.  This should do a GC with a single
> >> compaction pass which should not fail, and then make it much more
likely
> >> that the GC during snapshot will do a single compaction pass, since
fewer
> >> objects should be mobile after the single pass compaction in the
explicit
> >> GC.
> >
> > Rather than avoid the problem, in which case you'll get less samples,
> > can we temporarily have the snapshot create a second file
> > "my.image.beforeSnapshotGC".
> > so when it crashes, we'll have a great sample for you.
> >
> > I'm sure we are all keen (and grateful) to get a reliable compactor.
> > The pain is not so much that it crashes, but that the image is
corrupted.
> > If its possible/likely that "my.image.beforeSnapshotGC" might be renamed
> > and successfully opened, I'm sure those of use following bleeding edge
> > are capable and will to operate like that, to help bring a faster
resolution.
>
> This sounds like a good idea but the machinations involved in loading an
image make it non-trivial. I'd much rather implement lemming debugging in
the real vm.  In the simulator the vm is cloned on every GC and the GC is
run in the clone, and repeated in the original if it succeeds.  In the real
VM it would fork and execute the GC in the child, waiting for the exit
status.

Slightly different idea, considering the case of  Save&Continuing with
potentially very large 64bit images, I was wondering how feasible/ worth
while it might be to fork a process to do the save - so that the main
process only needs to pause long enough to make a COW clone of the page
table.

Cheers -ben

>
> This approach allows a buggy GC to be repeated as many times as it takes
to understand it.  And it could be altered to snapshot too, also to a
different name if desired.
>
> In any case let's hope the issue is moot :-).
>
> >
> > cheers -ben
> >
> >>
> >> To do this in Pharo I would put a full gc here:
> >>
> >> SessionManager>>snapshot: save andQuit: quit
> >> | isImageStarting snapshotResult |
> >> ChangesLog default logSnapshot: save andQuit: quit.
> >>
> >>>> SmalltalkImage current primitiveGarbageCollect.
> >>
> >> self currentSession stop: quit. "Image not usable from here until the
> >> session is restarted!"
> >> ...
> >>
> >> In Squeak I would put a full GC here:
> >>
> >> snapshot: save andQuit: quit withExitCode: exitCode embedded:
embeddedFlag
> >> "Mark the changes file and close all files as part of
#processShutdownList.
> >> If save is true, save the current state of this Smalltalk in the image
file.
> >> If quit is true, then exit to the outer OS shell.
> >> If exitCode is not nil, then use it as exit code.
> >> The latter part of this method runs when resuming a previously saved
image.
> >> This resume logic checks for a document file to process when starting
up."
> >>
> >> | resuming msg |
> >> Object flushDependents.
> >> Object flushEvents.
> >>
> >> ...
> >> Smalltalk processShutDownList: quit.
> >>>> SmalltalkImage current primitiveGarbageCollect.
> >> Cursor write show.
> >> save ifTrue: [resuming := embeddedFlag
> >> ifTrue: [self snapshotEmbeddedPrimitive]
> >> ifFalse: [self snapshotPrimitive]]  "<-- PC frozen here on image file"
> >> ifFalse: [resuming := false].
> >>
> >> I do apologise for the bug.  I hope it will be fixed within a few days.
> >>
> >> _,,,^..^,,,_
> >> best, Eliot
> >>
> >>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20170417/44ca09e7/attachment.html>


More information about the Squeak-dev mailing list