[Vm-dev] [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)
bera.clement at gmail.com
Fri Nov 29 23:50:55 UTC 2019
On Fri, Nov 29, 2019 at 7:28 PM Eliot Miranda <eliot.miranda at gmail.com>
> Hi Alistair,
> On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant <akgrant0710 at gmail.com>
>> Hi Clément,
>> On Thu, 28 Nov 2019 at 22:36, Clément Béra <bera.clement at gmail.com>
>> > Hi Alistair,
>> > I've just investigated the bug tonight and fixed it in
>> VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run
>> the 400 iterations of your script without any crashes.
>> > Thanks for the easy reproduction! Last year when I used the GC
>> benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper ,
>> I initially had an image crashing 9 times out of 10
>> > when going to 10Gb. I fixed a few bugs on the production GC back then
>> (mainly on segment management) which led the benchmarks to run successfully
>> 99% of the times. But it was still crashing
>> > on 1%, since I was benchmarking on experimental GCs with various
>> changes I thought the bug did not happen in the production GC, but it turns
>> out I was wrong. And you found a reliable way to
>> > reproduce :-). So I could investigate. It's so fun to do lemming
>> debugging in the simulator.
>> We need to thank Juraj here, he was the one who produced the initial
>> version of the script which made all of this possible.
>> > The GC bug was basically that when Planning Compactor (Production Full
>> GC compactor) decided to do a multiple pass compaction, if it managed to
>> compact everything in one go then it would
>> > get confused and attempt to compact objects upward instead of downward
>> (address wise) on the second attempt, and that's broken and corrupts memory.
>> > I started from this script:
>> > | aJson anArray |
>> > aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
>> > Array streamContents: [ :aStream |
>> > 400 timesRepeat: [
>> > aStream nextPutAll: (STON fromString: aJson contents).
>> > Smalltalk saveSession ] ].
>> > It makes me however very sad that you were not able to use the
>> simulator to debug this issue, I used it and that's how I tracked down the
>> bug in only a few hours. Tracking things down in lldb would have taken me
>> weeks, and I would not have been able to do it since I work during the week
>> > Therefore I'm going to explain you my process to reproduce the bug in
>> the simulator and to understand where the issue comes from. The mail is
>> quite long, but it would be nice if you could track the bug quickly on your
>> own next time using the simulator. Of course you can skip if you're not
>> interested. @Eliot you may read since I explain how I set-up a Pharo 7
>> image for simulator debugging, that might come handy for you at some point.
>> > 1] The first thing I did was to reproduce your bug, based on the
>> script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM
>> repository. I initially started with Pharo 8, but for some reason that
>> image is quite broken (formatter issue? Integrator gone wild?).
>> That was unlucky timing, there was a bad commit made. I think it's
>> largely tidied up now, still, using the current stable version isn't
>> necessarily bad :-)
>> Just for future reference: the first thing I tried was reproducing it
>> on the Pharo 8 minimal image (I did this before the formatter bug
>> appeared and kept the same image). The minimal image has a few
>> - It's smaller, 14M vs. 54M, so less memory to keep track of (and the
>> simulator will be a bit faster)
>> - It doesn't have FreeType loaded, so that quickly ruled it out as an
>> - I wasn't sure if there would be other FFI calls, so this just
>> reduced the chances.
> What we should have done is ran the test case using an assert VM with the
> leak checker turned on, running in gdb/lldb. This would have proved the
> bug was in GC on snapshot because
> - the leak check before GC on snapshot would have succeeded
> - the leak check immediately after GC for snapshot, but before snapshot,
> would have failed
> It may be that the leak check would not have failed, because in
> investigating this bug I added a bounds check before probing the leak map,
> so the leak map is only probed for pointers that are within the full extent
> of the heap (which, because the heap is segmented, may be a much larger
> range than the size of the heap). But know that the leak checker is a
> useful tool for pinpointing heap corruption and GC bugs. The leak checker
> is enabled by bitwise flags to apply to various GC activities (scavenge,
> full GC, become, and can be extended to be run on FFI call), and when
> enabled runs before and after each phase.
> When running an assert VM under gdb/lldb one puts a breakpoint in warning,
> the routine that outputs assert failure messages, and then runs an image.
> When running in the simulator asserts are always run, and the leak checker
> can be enabled by sending a message to the interpreter's objectMemory.
> > So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the
>> bug was unrelated to the JIT. Most bugs on the core VM (besides people
>> mis-using FFI, which is by far the most common VM bug reported) is either
>> JIT or GC. So we're tracking a GC bug.
>> > I then built an image which runs your script at start-up (Smalltalk
>> snapshot: true andQuit: true followed by your script, I select all and run
>> > 2] Then I started the image in the simulator. First thing I noticed is
>> that Pharo 7 is using FFI calls in FreeType, from start-up, and even if
>> you're not using text or if you disable FreeType from the setting browser,
>> Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI
>> calls are incorrectly implemented (the C stack references heap object which
>> are not pinned), therefore these calls corrupts the heap. Running a
>> corrupted heap on the VM has undefined behavior, therefore any usage of
>> Pharo 7 right now, wether you actually text or not, wether freetype is
>> enabled or not in the settings, is undefined behavior. I saw in the thread
>> Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning
>> objects is image-side responsibility and it's not a VM bug. In addition,
>> most reported bug comes from people mis-using FFI, so I understand their
>> answer. There was however another bug in the GC, but it's very hard for us
>> to debug it if it's hidden after image corrupting bugs like the FreeType
>> one here.
>> > So for that I made that change:
>> > FreeTypeSettings>>startUp: resuming
>> > "resuming ifTrue:[ self updateFreeType ]"
>> > saved, restarted the image, and ensured it was not corrupted (leak
>> checker + swizzling in simulation).
>> > 3] Then I started the image in the simulator. Turns out the image
>> start-up raises error if libgit cannot be loaded, and then the start-up
>> script is not executed due to the exception. So I made that change:
>> > LibGitLibrary>>startUp: isImageStarting
>> > "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
>> Also for future reference, I'm surprised you didn't hit an FFI call
>> trying to get the current working directory. Making the following
>> change in OSPlatform removes the FFI call:
>> currentWorkingDirectoryPathWithBuffer: aByteString
>> <primitive: 'primitiveGetCurrentWorkingDirectory' module:
>> 'UnixOSProcessPlugin' error: ec>
>> ^self primitiveFailed
>> (if on windows you need to use WinOSProcessorPlugin).
>> > 4] Turns out ZnEasy does not work well in the simulator.
> Can you say more on this?
FFI call in Zodiac > fail in sim > avoid it if possible,
>> So I preloaded this line aJson := ZnEasy get: '
>> https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global
>> variable. The rest of the script remains the same. I can finally run your
>> script in the simulator! Usually we simulate Squeak image and all these
>> preliminary steps are not required. But! It is still easier to reproduce
>> this bug that most bugs I have to deal with for Android at work, at least I
>> don't need to buy an uncommon device from an obscure chinese vendor to
>> reproduce :-).
>> I put the data in to a file and loaded it :-)
>> > 5] To shortcut simulation time, since the bug happened around the 60th
>> save for me, I build a different script which snapshots the image to
>> different image names.
>> We also updated the script to save to different files.
>> But did you actually get it to save the image in the simulator? I'm
>> just reproducing your work now but couldn't save an image due to a bug
>> in the FileAttributesPluginSimulator. I've got a fix and will commit
>> a bit later.
>> > With a crash at snapshot 59 (only change file written to disk), image
>> 57 was the latest non corrupted image. I then started the simulator (The
>> StackSimulator since we are debugging a GC bug, not the Cog simulator,
>> simulation is faster and simpler). I used the standard script available in
>> the workspace of the Cog dev image built from the guidelines. 
>> > | sis |
>> > sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
>> > "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
>> Set to e.g. 64 for something like the real VM."
>> > "sis assertValidExecutionPointersAtEachStep: false." "Set this to true
>> to turn on an expensive assert that checks for valid stack, frame pointers
>> etc on each bytecode. Useful when you're adding new bytecodes or exotic
>> execution primitives."
>> > sis openOn: 'Save57.image'.
>> > sis openAsMorph; run
>> > I then let the simulator simulate, went swimming for 1h, and came back
>> 1h30 later (with commute time). The bug happened in the simulator at save
>> 90, I don't know how long it took to reproduce, but < 1h30. Then I had an
>> assertion failure in the compactor:
>> > self assert: (self validRelocationPlanInPass: finalPass) = 0.
>> > Good! From there I debugged using lemming debugging (technique
>> described in , Section 3.2). When the assertion has failed, simulation
>> is the clone. I went up in the debugger to the point where the clone was
>> made, and restarted the same GC approximately 40 times during debugging
>> because once the heap is corrupted you cannot know anymore what the problem
>> is, but you need to trigger the problem to understand. 40 lemmings over
>> that cliff :-) Good lemmings.
>> > Then I quickly figured out that the GC was performing two successive
>> compactions, and that the second compaction is broken right at the start
>> (tries to move objects upward). Then I looked at the glue code in-between
>> the 2 compactions, and yeah, in the case where the first compaction has
>> compacted everything, the variables are incorrectly set for the second
>> compaction. I tried fixing the variables but it's not that easy, so instead
>> I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>> > 6] I then compiled a VM from the sources to check Slang translator
>> would not complain, it did not. I then built a stack VM (Cog VM seems to be
>> broken on tip of tree due on-going work for ARMv8 support) and run your
>> script again. I was able to run the 400 iterations without crash. Bug seems
>> to be fixed!
>> > @Eliot now needs to fix tip of tree, generate the code and produce new
>> VMs. ARMv8 support is quite exciting though, giving that MacBooks do not
>> support 32 bits any more and that the next Macbooks are rumoured to be on
>> ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>> > Alistair, let me know if you have questions. I hope you can work with
>> the simulator as efficiently as we can. If you've not seen it, there's this
>> screencast where I showed how I used the simulator to debug JIT bugs .
>> Audio is not very good because my spoken English sucks, but it shows the
>> main ideas.
>> > 
>> >  http://www.mirandabanda.org/cogblog/build-image/
>> > 
>> >  https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
>> You wrote in :
>> "the slightest change in the heap
>> might change the bug; any variability in timing or user input
>> can result in a different heap and hence in the bug morphing
>> or going into hiding."
>> This was evident in this issue. While the script (fortunately) would
>> always produce a crash, small changes, such as how the initial JSON is
>> loaded, or the name of the image that it is saved to, caused fairly
>> large changes in the number of loops to trigger the crash.
>> Also, while trying to reproduce your debug steps above, the image I
>> have already has memory leaks, so it isn't hitting the "self assert:
>> (self validRelocationPlanInPass: finalPass) = 0" assertion.
>> Thanks for the links, I'll keep reading.
>> Thanks again!
>> > --
>> > Clément Béra
>> > https://clementbera.github.io/
>> > https://clementbera.wordpress.com/
> best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Vm-dev