[squeak-dev] Crashes on snapshot with the new compactor

Mon Apr 17 18:26:20 UTC 2017

2017-04-17 9:51 GMT+02:00 Ben Coman <benjamin.t.coman at gmail.com>:

> On 29 Mar 2017 10:25 PM, "Eliot Miranda" <eliot.miranda at gmail.com> wrote:
> >
> > Hi Ben,
> >
> >
> > > On Mar 25, 2017, at 7:41 PM, Ben Coman <btc at openinworld.com> wrote:
> > >
> > >> On Sun, Mar 26, 2017 at 4:27 AM, Eliot Miranda <
> eliot.miranda at gmail.com> wrote:
> > >> Hi All,
> > >>
> > >>    a number of people are being affected by crashes on snapshotting
> the
> > >> image, the worst possible time for a crash.  There is a bug in the new
> > >> compactor that unfortunately bites when saving.  The compactor is
> invoked as
> > >> part of a full garbage collect after the garbage collector has feed
> > >> unreachable objects.  Normally the new compactor makes only a single
> pass
> > >> through the heap, which may not move all the objects that are
> possible to
> > >> move.  (The amount of objects that can be moved in a single pass is
> limited
> > >> by available free space.)  But on snapshot the compactor makes as may
> passes
> > >> as are necessary to slide all movable objects down as far as possible.
> > >> Unfortunately there is a bug in this second pass.
> > >>
> > >> Fixing this bug is now my priority.  I have an example image from
> Esteban
> > >> Lorenzano to test.  I am asking anyone else that can provide an image
> that
> > >> reliably crashes when trying to save it to make the image and changes
> > >> available to me for testing if possible.
> > >>
> > >> In the mean time one may be able to work around the problem by doing
> a full
> > >> garbage collect before snapshot.  This should do a GC with a single
> > >> compaction pass which should not fail, and then make it much more
> likely
> > >> that the GC during snapshot will do a single compaction pass, since
> fewer
> > >> objects should be mobile after the single pass compaction in the
> explicit
> > >> GC.
> > >
> > > Rather than avoid the problem, in which case you'll get less samples,
> > > can we temporarily have the snapshot create a second file
> > > "my.image.beforeSnapshotGC".
> > > so when it crashes, we'll have a great sample for you.
> > >
> > > I'm sure we are all keen (and grateful) to get a reliable compactor.
> > > The pain is not so much that it crashes, but that the image is
> corrupted.
> > > If its possible/likely that "my.image.beforeSnapshotGC" might be
> renamed
> > > and successfully opened, I'm sure those of use following bleeding edge
> > > are capable and will to operate like that, to help bring a faster
> resolution.
> >
> > This sounds like a good idea but the machinations involved in loading an
> image make it non-trivial. I'd much rather implement lemming debugging in
> the real vm.  In the simulator the vm is cloned on every GC and the GC is
> run in the clone, and repeated in the original if it succeeds.  In the real
> VM it would fork and execute the GC in the child, waiting for the exit
> status.
>
> Slightly different idea, considering the case of  Save&Continuing with
> potentially very large 64bit images, I was wondering how feasible/ worth
> while it might be to fork a process to do the save - so that the main
> process only needs to pause long enough to make a COW clone of the page
> table.
>
> Cheers -ben
>

The fork has another advantage:  we can do whatever clean-up before saving
(close files, free heap, etc...).

> >
> > This approach allows a buggy GC to be repeated as many times as it takes
> to understand it.  And it could be altered to snapshot too, also to a
> different name if desired.
> >
> > In any case let's hope the issue is moot :-).
> >
> > >
> > > cheers -ben
>
> > >
> > >>
> > >> To do this in Pharo I would put a full gc here:
> > >>
> > >> SessionManager>>snapshot: save andQuit: quit
> > >> | isImageStarting snapshotResult |
> > >> ChangesLog default logSnapshot: save andQuit: quit.
> > >>
> > >>>> SmalltalkImage current primitiveGarbageCollect.
> > >>
> > >> self currentSession stop: quit. "Image not usable from here until the
> > >> session is restarted!"
> > >> ...
> > >>
> > >> In Squeak I would put a full GC here:
> > >>
> > >> snapshot: save andQuit: quit withExitCode: exitCode embedded:
> embeddedFlag
> > >> "Mark the changes file and close all files as part of
> #processShutdownList.
> > >> If save is true, save the current state of this Smalltalk in the
> image file.
> > >> If quit is true, then exit to the outer OS shell.
> > >> If exitCode is not nil, then use it as exit code.
> > >> The latter part of this method runs when resuming a previously saved
> image.
> > >> This resume logic checks for a document file to process when starting
> up."
> > >>
> > >> | resuming msg |
> > >> Object flushDependents.
> > >> Object flushEvents.
> > >>
> > >> ...
> > >> Smalltalk processShutDownList: quit.
> > >>>> SmalltalkImage current primitiveGarbageCollect.
> > >> Cursor write show.
> > >> save ifTrue: [resuming := embeddedFlag
> > >> ifTrue: [self snapshotEmbeddedPrimitive]
> > >> ifFalse: [self snapshotPrimitive]]  "<-- PC frozen here on image file"
> > >> ifFalse: [resuming := false].
> > >>
> > >> I do apologise for the bug.  I hope it will be fixed within a few
> days.
> > >>
> > >> _,,,^..^,,,_
> > >> best, Eliot
> > >>
> > >>
> > >>
> > >
> >
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20170417/5568e116/attachment.html>