[Vm-dev] [Pharo-users] macOS VM builds with AddressSanitizer/LeakSanitizer
leuenberger at inf.unibe.ch
Tue Feb 26 22:31:35 UTC 2019
Alright, back in the game, this time using gperftools.
I finally managed to reproduce a leaking run, but it took me hours to get there, randomly rendering GtCoders and messing around. gperftools instruments malloc and friends, capturing heap profiles every 30 seconds. Now I have profiles from a freshly started, small image, before it went berserk on memory hogging, and when it allocates over 3GB. Now, I have visualizations of allocation sites at these instants, and diffs between them, as well as some thread dumps on the berserk state. I managed to rule out the in-image objects to contribute to memory by cleaning images with garbage collect and cache resets. Some handy resets are these:
FreeTypeCache current removeAll.
These get rid of a leak in NECController, resets the font cache (creates a lot of bitmaps), and ASTs, created by the coders. With
I get objects stats from within the image, and I managed to keep them in the same ballpark for all instants I took profiles of, by inspecting over-populated instances, like this
instances := WeakOrderedCollection withAll: RBMessageNode allInstances.
ReferenceFinder findPathTo: instances atRandom.
FFIExternalResourceManager uniqueInstance instVarNamed: #registry
I also made sure that external objects are clean.
Now, I am pretty confident that I can clean images, so the memory-overload should be sourced in the VM. Still, I cannot reliably reproduce the leak. It just happens after running an image for hours, doing stuff, idling, going to sleep, wake up.
Unfortunately, I did not run it on a debug VM, so nearly no symbols, but I prepared a summary of the analysis anyway, as it already gives some hints that some of you can maybe interpret better than me. Have look at the PDFs in particular, they show the allocation sites at the three interesting instants, (clean, before berserk, after berserk), as well as the diffs in between.
I find that berserk mode is caused by a huge pile of Objective-C objects. The thread samples show that while idling, it seems to be busy processing OS events. But I do not know what kind of objects they actually are.
My current hypothesis is:
The leak is caused by processing a huge pile of OS events being processed (reprocessed?). For some reason, this only happens after a long run time.
Now, how can I accept/reject this hypothesis? Is there a way to log events received by the VM? I also have a core dump of the 3GB process, but I have no idea how I could check it for Objective-C objects grouped by type. If anybody knows how to do that, I might be able to come up with a comprehensive list of objects that lead to the berserk mode. Maybe it is not even are real leak, but just accumulating objects that should be thrown away at some point. Also, I can still not fully reject that the image is involved, maybe it forgets external resources it should not forget at some point.
As I now know again that I can trigger berserk mode, I will switch to the debug VM/Moz2D again with ASan. More symbols, leak reports, might reveal better information.
Anyway, if anybody made it through the whole email and maybe even looked at the ZIP, I would appreciate ideas on how to proceed, what to test, what to inspect. I recently had a few incidents where I realized that one of my implicit assumptions/axioms does not hold, and I kept digging in the wrong direction for too long, so any feedback on this is welcome.
> On 30 Jan 2019, at 18:40, Manuel Leuenberger <leuenberger at inf.unibe.ch> wrote:
> I thought about adding a flag for all ASan magic in the build scripts, then document the procedure in the readme, maybe also more extensive somewhere else. You probably have a better idea where a good place would be than I do.
> I dug into this today again and wrote a parser for the reports (https://github.com/maenu/leak-reporter <https://github.com/maenu/leak-reporter>). It has a not-so-good visualization of merged stacks showing heavy frames as big and solid, it also includes example reports (https://github.com/maenu/leak-reporter/tree/master/res <https://github.com/maenu/leak-reporter/tree/master/res>). The reported leaks only amount for about 1% of leaked memory, not too many objects leaking, but I am not sure if ASan just counts object layout size or objects graphs (probably not). What I notice is that OS events are leaking, as they are copied. Might be an issue. But I am not entirely sure if this explains everything. Maybe the VM just allocates too many objects of some type, but frees them on shutdown, so they might not even appear in the reports. Can a C heap analyzer give me stats about struct instances per struct type? Might be useful to analyze core dumps then.
> <Screen Shot 2019-01-30 at 18.27.22.png>
> Direct leak of 219512 byte(s) in 1193 object(s) allocated from:
> #0 0x10bb1fffc in __sanitizer_mz_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5dffc)
> #1 0x7fffdd224281 in malloc_zone_malloc (libsystem_malloc.dylib:x86_64+0x2281)
> #2 0x7fffc6de33d5 in AllocateAndInitializeEvent(__CFAllocator const*) (HIToolbox:x86_64+0x33d5)
> #3 0x7fffc6e3fc4d in CopyEventInternal(__CFAllocator const*, OpaqueEventRef*) (HIToolbox:x86_64+0x5fc4d)
> #4 0x7fffc6e49228 in CopyEventAs (HIToolbox:x86_64+0x69228)
> #5 0x7fffc6e1d349 in CreateEventWithCGEvent (HIToolbox:x86_64+0x3d349)
> #6 0x7fffc6e1b0e2 in CreateAndPostEventWithCGEvent(__CGEvent*, unsigned int, unsigned char, __CFMachPortBoost*) (HIToolbox:x86_64+0x3b0e2)
> #7 0x7fffc6e275cd in Convert1CGEvent(unsigned char) (HIToolbox:x86_64+0x475cd)
> #8 0x7fffc6e2745e in MainLoopObserver(unsigned int, OpaqueEventRef*, void*) (HIToolbox:x86_64+0x4745e)
> #9 0x7fffc6de8368 in _NotifyEventLoopObservers (HIToolbox:x86_64+0x8368)
> #10 0x7fffc6e10ea5 in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ea5)
> #11 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8)
> #12 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25)
> #13 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53)
> #14 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed)
> #15 0x10b7ce07c in -[sqSqueakOSXApplication(events) pumpRunLoopEventSendAndSignal:] sqSqueakOSXApplication+events.m:78
> #16 0x10b7ce7d9 in -[sqSqueakOSXApplication(events) pumpRunLoop] sqSqueakOSXApplication+events.m:105
> #17 0x10b7e4898 in vmIOProcessEvents sqSqueakEventsAPI.m:80
> #18 0x10b7e49c7 in ioProcessEvents sqSqueakEventsAPI.m:103
> #19 0x10b5b5653 in checkForEventsMayContextSwitch gcc3x-cointerp.c:62680
> #20 0x10b5c17ed in ceCheckForInterrupts gcc3x-cointerp.c:15188
> #21 0x119a75794 (<unknown module>)
> #22 0x10b566712 in interpret gcc3x-cointerp.c:2754
> #23 0x10b7ebada in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201
> #24 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc)
> #25 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56)
> #26 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6)
> #27 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8)
> #28 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033)
> #29 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb)
>> On 29 Jan 2019, at 20:19, Alistair Grant <akgrant0710 at gmail.com <mailto:akgrant0710 at gmail.com>> wrote:
>> On Tue, 29 Jan 2019 at 15:32, Eliot Miranda <eliot.miranda at gmail.com <mailto:eliot.miranda at gmail.com>> wrote:
>>> So this is great. The final step is seeing which FFI method invokes moz2d_draw_target_create_for_data_type and then the storage leak can be fixed.
>>> What would be wonderful is to capture your experience in diagnosing this storage leak as a guide to others. But in what form? One would be a blog post. I’m nervous though because blogs sometimes disappear. One would be some section in one of the READMEs in the source tree. One would be a carefully written email to this list recapitulating or summarizing the experience.
>>> The advantage of the mailing list is that it won’t go away any time soon. So how about writing a new message in this thread which summarizes? It could be written as more of a guide “Storage leaks come in more than one variety... To diagnose a leak in the C heap... To link against a C leak checker...” and include links to the other messages in this thread on the mail list serve for details “Specifics on how to modify the Mac build to link in the leak checker are ...”.
>>> What do you think? What to others think? How do we best capture this for posterity?
>> I would split this in two:
>> - The basic instructions on compiling with the leak checker should go
>> in the HowToBuild.
>> - The description you're suggesting above should go in
>> http://opensmalltalk.org/ <http://opensmalltalk.org/> as Ben suggested, or a wiki page in the
>> OpenSmalltalk-VM repository.
>> Each should provide a link to the other.
>> I'm not a big fan of github wiki pages, but they have two advantages:
>> - They keep the content with the code.
>> - They're relatively easy to download and back-up offline (in the
>> event that we decide to move the code elsewhere).
>> My preference would be http://opensmalltalk.org/ <http://opensmalltalk.org/>, but I'm not sure
>> what the access rights on it are. Actually, I didn't even know it
>> existed, thanks Ben!
>> As Tim said, email lists are hard to search, and can't be easily
>> updated. Blog posts don't really lend themselves to being updated
>> My 2c...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Vm-dev