[Vm-dev] [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Clément Béra bera.clement at gmail.com
Fri Nov 29 13:58:35 UTC 2019


Indeed I also had an issue with workingDirectoryPath. This is the complete
list of what I changed in the Pharo 7 image,
it's not clear to me if everything is required (Like the zodiac change, I
ended up using the global instead):

!FreeTypeSystemSettings class methodsFor: 'settings' stamp: 'cb 11/15/2019
21:22' prior: 25673466!
^ Smalltalk ui theme
newLabelIn: World
for: self
label: 'Available version: None'
getEnabled: nil.! !
!FreeTypeSettings class methodsFor: 'system startup' stamp: 'cb 11/15/2019
21:24' prior: 25665613!
startUp: resuming
"resuming ifTrue:[ self updateFreeType ]"! !
!LGitLibrary class methodsFor: 'system startup' stamp: 'cb 11/15/2019
21:24' prior: 31291584!
startUp: isImageStarting
"isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"! !
!OSPlatform methodsFor: 'accessing' stamp: 'cb 11/15/2019 21:36' prior:
"This method calls the method getPwdViaFFI with arugement of a buffer size.
By default it uses the defaultMaximumPathLength of each subclass as the
buffer size."
^ '/' "self currentWorkingDirectoryPathWithBufferSize: self
! !
!ZdcPluginSSLSession methodsFor: 'initialization' stamp: 'cb 11/15/2019
21:39' prior: 66961330!
"Initialize the receiver"
"[ handle := self primitiveSSLCreate ]
on: PrimitiveFailed
do: [ :exception |
ZdcPluginMissing signal ].
self logging: false"
! !
!PharoFilesOpener methodsFor: 'public' stamp: 'cb 11/27/2019 22:56' prior:
changesFileOrNilReadOnly: readOnly silent: silent
| changesFile |
changesFile := self openChanges: self changesName readOnly: readOnly.
(changesFile isNil and: [ silent not ])
ifTrue: [ self informProblemInChanges: self cannotLocateMsg ].
^ changesFile .
! !

On Fri, Nov 29, 2019 at 11:53 AM Clément Béra <bera.clement at gmail.com>

> Hi,
> On Fri, Nov 29, 2019, 10:21 Alistair Grant <akgrant0710 at gmail.com> wrote:
>> Hi Clément,
>> On Thu, 28 Nov 2019 at 22:36, Clément Béra <bera.clement at gmail.com>
>> wrote:
>> >
>> > Hi Alistair,
>> >
>> > I've just investigated the bug tonight and fixed it in
>> VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run
>> the 400 iterations of your script without any crashes.
>> > Thanks for the easy reproduction! Last year when I used the GC
>> benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1],
>> I initially had an image crashing 9 times out of 10
>> > when going to 10Gb. I fixed a few bugs on the production GC back then
>> (mainly on segment management) which led the benchmarks to run successfully
>> 99% of the times. But it was still crashing
>> > on 1%, since I was benchmarking on experimental GCs with various
>> changes I thought the bug did not happen in the production GC, but it turns
>> out I was wrong. And you found a reliable way to
>> > reproduce :-). So I could investigate. It's so fun to do lemming
>> debugging in the simulator.
>> We need to thank Juraj here, he was the one who produced the initial
>> version of the script which made all of this possible.
>> Thanks Juraj. Are you both Feenk people?
> Are you mainly working on the VM Alistair? Or just having fun?
>> > The GC bug was basically that when Planning Compactor (Production Full
>> GC compactor) decided to do a multiple pass compaction, if it managed to
>> compact everything in one go then it would
>> > get confused and attempt to compact objects upward instead of downward
>> (address wise) on the second attempt, and that's broken and corrupts memory.
>> >
>> > I started from this script:
>> >
>> > | aJson anArray |
>> > aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
>> asZnUrl.
>> > Array streamContents: [ :aStream |
>> > 400 timesRepeat: [
>> > aStream nextPutAll: (STON fromString: aJson contents).
>> > Smalltalk saveSession ] ].
>> >
>> >
>> > It makes me however very sad that you were not able to use the
>> simulator to debug this issue, I used it and that's how I tracked down the
>> bug in only a few hours. Tracking things down in lldb would have taken me
>> weeks, and I would not have been able to do it since I work during the week
>> :-).
>> >
>> > Therefore I'm going to explain you my process to reproduce the bug in
>> the simulator and to understand where the issue comes from. The mail is
>> quite long, but it would be nice if you could track the bug quickly on your
>> own next time using the simulator. Of course you can skip if you're not
>> interested. @Eliot you may read since I explain how I set-up a Pharo 7
>> image for simulator debugging, that might come handy for you at some point.
>> >
>> > 1] The first thing I did was to reproduce your bug, based on the
>> script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM
>> repository. I initially started with Pharo 8, but for some reason that
>> image is quite broken (formatter issue? Integrator gone wild?).
>> That was unlucky timing, there was a bad commit made.  I think it's
>> largely tidied up now, still, using the current stable version isn't
>> necessarily bad :-)
>> Just for future reference: the first thing I tried was reproducing it
>> on the Pharo 8 minimal image (I did this before the formatter bug
>> appeared and kept the same image).  The minimal image has a few
>> advantages:
>> - It's smaller, 14M vs. 54M, so less memory to keep track of (and the
>> simulator will be a bit faster)
>> - It doesn't have FreeType loaded, so that quickly ruled it out as an
>> issue.
>> - I wasn't sure if there would be other FFI calls, so this just
>> reduced the chances.
>> Not having FreeType and LibGit would be nice indeed. The difference
> between simulation performance 14Mb-54Mb is not really an issue for me, the
> bug happened on > 100Mb heap and simulation is still fairly fast.
> The problem is more to find a reliable way to crash soon after start-up,
> in some cases I start the simulator, go to sleep, but if the next morning
> it hasn't crashed, well, too bad :-(.
> In most cases we reproduce bugs using the Squeak REPL image. See:
> https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtrunkreaderimage.sh
> I suggest you try using the simulator on the squeak repl, it's convenient
> you can run a few things and see what is going on. The REPL support chunk
> format (Put a ! after each do it).
> You can build something similar from the minimal Pharo if you want to, but
> I doubt you'll catch bugs that you can't catch from the Squeak one.
>> > So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the
>> bug was unrelated to the JIT. Most bugs on the core VM (besides people
>> mis-using FFI, which is by far the most common VM bug reported) is either
>> JIT or GC. So we're tracking a GC bug.
>> > I then built an image which runs your script at start-up (Smalltalk
>> snapshot: true andQuit: true followed by your script, I select all and run
>> do-it).
>> >
>> > 2] Then I started the image in the simulator. First thing I noticed is
>> that Pharo 7 is using FFI calls in FreeType, from start-up, and even if
>> you're not using text or if you disable FreeType from the setting browser,
>> Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI
>> calls are incorrectly implemented (the C stack references heap object which
>> are not pinned), therefore these calls corrupts the heap. Running a
>> corrupted heap on the VM has undefined behavior, therefore any usage of
>> Pharo 7 right now, wether you actually text or not, wether freetype is
>> enabled or not in the settings, is undefined behavior. I saw in the thread
>> Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning
>> objects is image-side responsibility and it's not a VM bug. In addition,
>> most reported bug comes from people mis-using FFI, so I understand their
>> answer. There was however another bug in the GC, but it's very hard for us
>> to debug it if it's hidden after image corrupting bugs like the FreeType
>> one here.
>> > So for that I made that change:
>> > FreeTypeSettings>>startUp: resuming
>> > "resuming ifTrue:[ self updateFreeType ]"
>> > saved, restarted the image, and ensured it was not corrupted (leak
>> checker + swizzling in simulation).
>> >
>> > 3] Then I started the image in the simulator. Turns out the image
>> start-up raises error if libgit cannot be loaded, and then the start-up
>> script is not executed due to the exception. So I made that change:
>> > LibGitLibrary>>startUp: isImageStarting
>> > "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
>> Also for future reference, I'm surprised you didn't hit an FFI call
>> trying to get the current working directory.  Making the following
>> change in OSPlatform removes the FFI call:
>> currentWorkingDirectoryPathWithBuffer: aByteString
>>     <primitive: 'primitiveGetCurrentWorkingDirectory' module:
>> 'UnixOSProcessPlugin' error: ec>
>>     ^self primitiveFailed
>> (if on windows you need to use WinOSProcessorPlugin).
>> Err. Maybe I forgot to write down a few steps here and commented a few
> other methods... I fixed it and then wrote the mail, I don't remember it
> all.
> I think indeed there was something accessing source or change files and I
> commented something in there.
> I'll try to check the change file later on.
> I don't have access to my laptop right now I'm at work so I cannot check.
>> > 4] Turns out ZnEasy does not work well in the simulator. So I preloaded
>> this line aJson := ZnEasy get: '
>> https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global
>> variable. The rest of the script remains the same. I can finally run your
>> script in the simulator! Usually we simulate Squeak image and all these
>> preliminary steps are not required. But! It is still easier to reproduce
>> this bug that most bugs I have to deal with for Android at work, at least I
>> don't need to buy an uncommon device from an obscure chinese vendor to
>> reproduce :-).
>> I put the data in to a file and loaded it :-)
>> > 5] To shortcut simulation time, since the bug happened around the 60th
>> save for me, I build a different script which snapshots the image to
>> different image names.
>> We also updated the script to save to different files.
>> But did you actually get it to save the image in the simulator?  I'm
>> just reproducing your work now but couldn't save an image due to a bug
>> in the FileAttributesPluginSimulator.  I've got a fix and will commit
>> a bit later.
>> Yes, running the script in the simulator generated me around 30 images
> (Save57.image to Save90.image). I frequently use saving from the simulator
> (usually Squeak image though). Should work.
> Then running the script again to 400 iterations from the VM I generated
> filled my local SSD :-).
> I don't remember which API I used though to save, maybe we used different
> ones? I try to use snapshot:andQuit: as much as possible to avoid
> unexpected errors, but this time I renamed, I don't remember how.
>> > With a crash at snapshot 59 (only change file written to disk), image
>> 57 was the latest non corrupted image. I then started the simulator (The
>> StackSimulator since we are debugging a GC bug, not the Cog simulator,
>> simulation is faster and simpler). I used the standard script available in
>> the workspace of the Cog dev image built from the guidelines. [2]
>> > | sis |
>> > sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
>> Spur64BitMemoryManager).
>> > "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
>> Set to e.g. 64 for something like the real VM."
>> > "sis assertValidExecutionPointersAtEachStep: false." "Set this to true
>> to turn on an expensive assert that checks for valid stack, frame pointers
>> etc on each bytecode.  Useful when you're adding new bytecodes or exotic
>> execution primitives."
>> > sis openOn: 'Save57.image'.
>> > sis openAsMorph; run
>> > I then let the simulator simulate, went swimming for 1h, and came back
>> 1h30 later (with commute time). The bug happened in the simulator at save
>> 90, I don't know how long it took to reproduce, but < 1h30. Then I had an
>> assertion failure in the compactor:
>> >  self assert: (self validRelocationPlanInPass: finalPass) = 0.
>> > Good! From there I debugged using lemming debugging (technique
>> described in [3], Section 3.2). When the assertion has failed, simulation
>> is the clone. I went up in the debugger to the point where the clone was
>> made, and restarted the same GC approximately 40 times during debugging
>> because once the heap is corrupted you cannot know anymore what the problem
>> is, but you need to trigger the problem to understand. 40 lemmings over
>> that cliff :-) Good lemmings.
>> >
>> > Then I quickly figured out that the GC was performing two successive
>> compactions, and that the second compaction is broken right at the start
>> (tries to move objects upward). Then I looked at the glue code in-between
>> the 2 compactions, and yeah, in the case where the first compaction has
>> compacted everything, the variables are incorrectly set for the second
>> compaction. I tried fixing the variables but it's not that easy, so instead
>> I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>> >
>> > 6] I then compiled a VM from the sources to check Slang translator
>> would not complain, it did not. I then built a stack VM (Cog VM seems to be
>> broken on tip of tree due on-going work for ARMv8 support) and run your
>> script again. I was able to run the 400 iterations without crash. Bug seems
>> to be fixed!
>> >
>> > @Eliot now needs to fix tip of tree, generate the code and produce new
>> VMs. ARMv8 support is quite exciting though, giving that MacBooks do not
>> support 32 bits any more and that the next Macbooks are rumoured to be on
>> ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>> >
>> > Alistair, let me know if you have questions. I hope you can work with
>> the simulator as efficiently as we can. If you've not seen it, there's this
>> screencast where I showed how I used the simulator to debug JIT bugs [4].
>> Audio is not very good because my spoken English sucks, but it shows the
>> main ideas.
>> >
>> > [1]
>> https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
>> > [2] http://www.mirandabanda.org/cogblog/build-image/
>> > [3]
>> https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
>> > [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
>> You wrote in [3]:
>> "the slightest change in the heap
>> might change the bug; any variability in timing or user input
>> can result in a different heap and hence in the bug morphing
>> or going into hiding."
>> This was evident in this issue.  While the script (fortunately) would
>> always produce a crash, small changes, such as how the initial JSON is
>> loaded, or the name of the image that it is saved to, caused fairly
>> large changes in the number of loops to trigger the crash.
> Yeah that's the main problem when debugging GC in general. Pharo is less
> deterministic than Squeak for some reason (things are happening in the
> background doing FFI calls). In both environment user events is a problem.
> That's why lemming debugging is very handy. And that's why
> OpenSmalltalk-VM development tools are far superior to other VMs I've dealt
> with. The back-in-time features that I used in C++ recently are very good
> though, in OpenSmalltalk-VM
> I guess the circular buffer of JIT simulation has a better time spent on
> tools/productivity ratio and is enough for now.
> And this is a crash. Performance pitfalls issue are even harder to track
> down IMO.
>> Also, while trying to reproduce your debug steps above, the image I
>> have already has memory leaks, so it isn't hitting the "self assert:
>> (self validRelocationPlanInPass: finalPass) = 0" assertion.
> You have to start simulation on a non already corrupted image. Did you
> make sure to comment the startUp: method in FreeTypeSettings? Disabling
> FreeType in the setting browser is not enough. Then you need to save and
> restart the image, and verifies it is not already corrupted.
> If you're talking about starting simulation from the saved images from the
> script, I did not take the latest which crashed because it was already
> corrupted, I used 57 while 58 was saved and 59 only changes were saved. You
> can see at start-up if swizzling and the initial GC find leaks.
>> Thanks for the links, I'll keep reading.
>> Thanks again!
>> Alistair
>> > --
>> > Clément Béra
>> > https://clementbera.github.io/
>> > https://clementbera.wordpress.com/

Clément Béra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20191129/c6b4e6e5/attachment-0001.html>

More information about the Vm-dev mailing list