[Vm-dev] [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Eliot Miranda eliot.miranda at gmail.com
Fri Nov 29 18:28:17 UTC 2019


Hi Alistair,

On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant <akgrant0710 at gmail.com>
wrote:

> Hi Clément,
>
>
> On Thu, 28 Nov 2019 at 22:36, Clément Béra <bera.clement at gmail.com> wrote:
> >
> > Hi Alistair,
> >
> > I've just investigated the bug tonight and fixed it in
> VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run
> the 400 iterations of your script without any crashes.
> > Thanks for the easy reproduction! Last year when I used the GC
> benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1],
> I initially had an image crashing 9 times out of 10
> > when going to 10Gb. I fixed a few bugs on the production GC back then
> (mainly on segment management) which led the benchmarks to run successfully
> 99% of the times. But it was still crashing
> > on 1%, since I was benchmarking on experimental GCs with various changes
> I thought the bug did not happen in the production GC, but it turns out I
> was wrong. And you found a reliable way to
> > reproduce :-). So I could investigate. It's so fun to do lemming
> debugging in the simulator.
>
> We need to thank Juraj here, he was the one who produced the initial
> version of the script which made all of this possible.
>
>
> > The GC bug was basically that when Planning Compactor (Production Full
> GC compactor) decided to do a multiple pass compaction, if it managed to
> compact everything in one go then it would
> > get confused and attempt to compact objects upward instead of downward
> (address wise) on the second attempt, and that's broken and corrupts memory.
> >
> > I started from this script:
> >
> > | aJson anArray |
> > aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
> asZnUrl.
> > Array streamContents: [ :aStream |
> > 400 timesRepeat: [
> > aStream nextPutAll: (STON fromString: aJson contents).
> > Smalltalk saveSession ] ].
> >
> >
> > It makes me however very sad that you were not able to use the simulator
> to debug this issue, I used it and that's how I tracked down the bug in
> only a few hours. Tracking things down in lldb would have taken me weeks,
> and I would not have been able to do it since I work during the week :-).
> >
> > Therefore I'm going to explain you my process to reproduce the bug in
> the simulator and to understand where the issue comes from. The mail is
> quite long, but it would be nice if you could track the bug quickly on your
> own next time using the simulator. Of course you can skip if you're not
> interested. @Eliot you may read since I explain how I set-up a Pharo 7
> image for simulator debugging, that might come handy for you at some point.
> >
> > 1] The first thing I did was to reproduce your bug, based on the script,
> both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I
> initially started with Pharo 8, but for some reason that image is quite
> broken (formatter issue? Integrator gone wild?).
>
> That was unlucky timing, there was a bad commit made.  I think it's
> largely tidied up now, still, using the current stable version isn't
> necessarily bad :-)
>
> Just for future reference: the first thing I tried was reproducing it
> on the Pharo 8 minimal image (I did this before the formatter bug
> appeared and kept the same image).  The minimal image has a few
> advantages:
>
> - It's smaller, 14M vs. 54M, so less memory to keep track of (and the
> simulator will be a bit faster)
> - It doesn't have FreeType loaded, so that quickly ruled it out as an
> issue.
> - I wasn't sure if there would be other FFI calls, so this just
> reduced the chances.
>

What we should have done is ran the test case using an assert VM with the
leak checker turned on, running in gdb/lldb.  This would have proved the
bug was in GC on snapshot because
- the leak check before GC on snapshot would have succeeded
- the leak check immediately after GC for snapshot, but before snapshot,
would have failed

It may be that the leak check would not have failed, because in
investigating this bug I added a bounds check before probing the leak map,
so the leak map is only probed for pointers that are within the full extent
of the heap (which, because the heap is segmented, may be a much larger
range than the size of the heap).  But know that the leak checker is a
useful tool for pinpointing heap corruption and GC bugs.  The leak checker
is enabled by bitwise flags to apply to various GC activities (scavenge,
full GC, become, and can be extended to be run on FFI call), and when
enabled runs before and after each phase.

When running an assert VM under gdb/lldb one puts a breakpoint in warning,
the routine that outputs assert failure messages, and then runs an image.

When running in the simulator asserts are always run, and the leak checker
can be enabled by sending a message to the interpreter's objectMemory.


> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the
> bug was unrelated to the JIT. Most bugs on the core VM (besides people
> mis-using FFI, which is by far the most common VM bug reported) is either
> JIT or GC. So we're tracking a GC bug.
> > I then built an image which runs your script at start-up (Smalltalk
> snapshot: true andQuit: true followed by your script, I select all and run
> do-it).
> >
> > 2] Then I started the image in the simulator. First thing I noticed is
> that Pharo 7 is using FFI calls in FreeType, from start-up, and even if
> you're not using text or if you disable FreeType from the setting browser,
> Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI
> calls are incorrectly implemented (the C stack references heap object which
> are not pinned), therefore these calls corrupts the heap. Running a
> corrupted heap on the VM has undefined behavior, therefore any usage of
> Pharo 7 right now, wether you actually text or not, wether freetype is
> enabled or not in the settings, is undefined behavior. I saw in the thread
> Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning
> objects is image-side responsibility and it's not a VM bug. In addition,
> most reported bug comes from people mis-using FFI, so I understand their
> answer. There was however another bug in the GC, but it's very hard for us
> to debug it if it's hidden after image corrupting bugs like the FreeType
> one here.
> > So for that I made that change:
> > FreeTypeSettings>>startUp: resuming
> > "resuming ifTrue:[ self updateFreeType ]"
> > saved, restarted the image, and ensured it was not corrupted (leak
> checker + swizzling in simulation).
> >
> > 3] Then I started the image in the simulator. Turns out the image
> start-up raises error if libgit cannot be loaded, and then the start-up
> script is not executed due to the exception. So I made that change:
> > LibGitLibrary>>startUp: isImageStarting
> > "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
>
> Also for future reference, I'm surprised you didn't hit an FFI call
> trying to get the current working directory.  Making the following
> change in OSPlatform removes the FFI call:
>
> currentWorkingDirectoryPathWithBuffer: aByteString
>     <primitive: 'primitiveGetCurrentWorkingDirectory' module:
> 'UnixOSProcessPlugin' error: ec>
>     ^self primitiveFailed
>
> (if on windows you need to use WinOSProcessorPlugin).
>
>
> > 4] Turns out ZnEasy does not work well in the simulator.


Can you say more on this?


> So I preloaded this line aJson := ZnEasy get: '
> https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global
> variable. The rest of the script remains the same. I can finally run your
> script in the simulator! Usually we simulate Squeak image and all these
> preliminary steps are not required. But! It is still easier to reproduce
> this bug that most bugs I have to deal with for Android at work, at least I
> don't need to buy an uncommon device from an obscure chinese vendor to
> reproduce :-).
>
> I put the data in to a file and loaded it :-)
>
> > 5] To shortcut simulation time, since the bug happened around the 60th
> save for me, I build a different script which snapshots the image to
> different image names.
>
> We also updated the script to save to different files.
>
> But did you actually get it to save the image in the simulator?  I'm
> just reproducing your work now but couldn't save an image due to a bug
> in the FileAttributesPluginSimulator.  I've got a fix and will commit
> a bit later.
>
>
> > With a crash at snapshot 59 (only change file written to disk), image 57
> was the latest non corrupted image. I then started the simulator (The
> StackSimulator since we are debugging a GC bug, not the Cog simulator,
> simulation is faster and simpler). I used the standard script available in
> the workspace of the Cog dev image built from the guidelines. [2]
> > | sis |
> > sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
> Spur64BitMemoryManager).
> > "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
> Set to e.g. 64 for something like the real VM."
> > "sis assertValidExecutionPointersAtEachStep: false." "Set this to true
> to turn on an expensive assert that checks for valid stack, frame pointers
> etc on each bytecode.  Useful when you're adding new bytecodes or exotic
> execution primitives."
> > sis openOn: 'Save57.image'.
> > sis openAsMorph; run
> > I then let the simulator simulate, went swimming for 1h, and came back
> 1h30 later (with commute time). The bug happened in the simulator at save
> 90, I don't know how long it took to reproduce, but < 1h30. Then I had an
> assertion failure in the compactor:
> >  self assert: (self validRelocationPlanInPass: finalPass) = 0.
> > Good! From there I debugged using lemming debugging (technique described
> in [3], Section 3.2). When the assertion has failed, simulation is the
> clone. I went up in the debugger to the point where the clone was made, and
> restarted the same GC approximately 40 times during debugging because once
> the heap is corrupted you cannot know anymore what the problem is, but you
> need to trigger the problem to understand. 40 lemmings over that cliff :-)
> Good lemmings.
> >
> > Then I quickly figured out that the GC was performing two successive
> compactions, and that the second compaction is broken right at the start
> (tries to move objects upward). Then I looked at the glue code in-between
> the 2 compactions, and yeah, in the case where the first compaction has
> compacted everything, the variables are incorrectly set for the second
> compaction. I tried fixing the variables but it's not that easy, so instead
> I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
> >
> > 6] I then compiled a VM from the sources to check Slang translator would
> not complain, it did not. I then built a stack VM (Cog VM seems to be
> broken on tip of tree due on-going work for ARMv8 support) and run your
> script again. I was able to run the 400 iterations without crash. Bug seems
> to be fixed!
> >
> > @Eliot now needs to fix tip of tree, generate the code and produce new
> VMs. ARMv8 support is quite exciting though, giving that MacBooks do not
> support 32 bits any more and that the next Macbooks are rumoured to be on
> ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
> >
> > Alistair, let me know if you have questions. I hope you can work with
> the simulator as efficiently as we can. If you've not seen it, there's this
> screencast where I showed how I used the simulator to debug JIT bugs [4].
> Audio is not very good because my spoken English sucks, but it shows the
> main ideas.
> >
> > [1]
> https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> > [2] http://www.mirandabanda.org/cogblog/build-image/
> > [3]
> https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> > [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
>
> You wrote in [3]:
>
> "the slightest change in the heap
> might change the bug; any variability in timing or user input
> can result in a different heap and hence in the bug morphing
> or going into hiding."
>
> This was evident in this issue.  While the script (fortunately) would
> always produce a crash, small changes, such as how the initial JSON is
> loaded, or the name of the image that it is saved to, caused fairly
> large changes in the number of loops to trigger the crash.
>
> Also, while trying to reproduce your debug steps above, the image I
> have already has memory leaks, so it isn't hitting the "self assert:
> (self validRelocationPlanInPass: finalPass) = 0" assertion.
>
> Thanks for the links, I'll keep reading.
>
> Thanks again!
> Alistair
>
> > --
> > Clément Béra
> > https://clementbera.github.io/
> > https://clementbera.wordpress.com/
>


-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20191129/b4431009/attachment-0001.html>


More information about the Vm-dev mailing list