Executing the following script produces a segmentation fault:
```Smalltalk | aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ]. ```
[crash.dmp.txt](https://github.com/OpenSmalltalk/opensmalltalk-vm/files/3843413/crash.dmp.tx...)
Reproduced on Pharo 8 Mac OS:
``` Image ----- Pharo8.0.0 Build information: Pharo-8.0.0+build.972.sha.bbb812d1387fd27ef4096e35224cb4425a60ab6c (64 Bit) Unnamed
Virtual Machine --------------- /Users/.../Pharo/vms/80-x64/Pharo.app/Contents/MacOS/Pharo CoInterpreter VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6 Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Mac OS X built on Jan 5 2019 19:11:02 UTC Compiler: 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31) VMMaker versionString VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6 Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git CoInterpreter VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 ```
Also reproduced on Pharo 8 Mac OS: ``` Virtual Machine --------------- /Users/syrel/Documents/Pharo/images/GToolkit/../../vms/80-x64/Pharo.app/Contents/MacOS/Pharo CoInterpreter VMMaker.oscog-eem.2570 uuid: b61e294a-cb2a-4d9a-9e7e-8cc17676c920 Oct 16 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2570 uuid: b61e294a-cb2a-4d9a-9e7e-8cc17676c920 Oct 16 2019 VM: 201910161212 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Wed Oct 16 14:12:15 2019 CommitHash: da97dd5 Plugins: 201910161212 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Mac OS X built on Oct 16 2019 12:39:27 GMT Compiler: 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42) ```
Also reproduced on Pharo 8 Linux Ubuntu 18.04: ``` Pharo VM version: 5.0-201902062351 Wed Feb 6 23:59:26 UTC 2019 gcc 4.8 [Production Spur 64-bit VM] Built from: CoInterpreter VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 6 2019 With: StackToRegisterMappingCogit VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 6 2019 Revision: VM: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Wed Feb 6 15:51:18 2019 CommitHash: a838346b Plugins: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Build host: Linux travis-job-f2b22483-7f84-414f-b833-69f69518c685 4.4.0-101-generic #124~14.04.1-Ubuntu SMP Fri Nov 10 19:05:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux plugin path: /dev/shm/p8/pharo-vm/lib/pharo/5.0-201902062351 [default: /dev/shm/p8/pharo-vm/lib/pharo/5.0-201902062351/] ```
This https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/391#issuecomment-55... issue comment report a similar reproducible script with a different stack dump on segmentation fault.
Also crashes with the latest Pharo Stable VM on Mac OS Mojave
``` Users/andrei/Documents/Pharo/vms/80-x64/Pharo.app/Contents/MacOS/Pharo CoInterpreter VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 7 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 7 2019 VM: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Wed Feb 6 15:51:18 2019 CommitHash: a838346 Plugins: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Mac OS X built on Feb 7 2019 00:01:47 UTC Compiler: 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31) VMMaker versionString VM: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Wed Feb 6 15:51:18 2019 CommitHash: a838346 Plugins: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git CoInterpreter VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 7 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 7 2019 ```
``` 0 libsystem_kernel.dylib 0x00007fff6e5bb2c6 __pthread_kill + 10 1 libsystem_pthread.dylib 0x00007fff6e676bf1 pthread_kill + 284 2 libsystem_c.dylib 0x00007fff6e5256a6 abort + 127 3 org.pharo.Pharo 0x000000010913f662 sigsegv + 209 4 libsystem_platform.dylib 0x00007fff6e66bb5d _sigtramp + 29 5 ??? 000000000000000000 0 + 0 6 org.pharo.Pharo 0x00000001090db9d2 markObjects + 463 7 org.pharo.Pharo 0x00000001090db1aa fullGC + 72 8 org.pharo.Pharo 0x00000001090facdf snapshot + 206 9 org.pharo.Pharo 0x00000001090ef837 primitiveSnapshot + 11 10 org.pharo.Pharo 0x00000001090bb38d interpret + 17947 11 org.pharo.Pharo 0x00000001090c4c1f enterSmalltalkExecutiveImplementation + 149 12 org.pharo.Pharo 0x00000001090b6fe6 interpret + 628 13 org.pharo.Pharo 0x0000000109140a82 -[sqSqueakMainApplication runSqueak] + 393 14 com.apple.Foundation 0x00007fff4489fc4a __NSFirePerformWithOrder + 362 15 com.apple.CoreFoundation 0x00007fff42591928 __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 23 16 com.apple.CoreFoundation 0x00007fff4259185d __CFRunLoopDoObservers + 451 17 com.apple.CoreFoundation 0x00007fff42533f80 __CFRunLoopRun + 1136 18 com.apple.CoreFoundation 0x00007fff425338be CFRunLoopRunSpecific + 455 19 com.apple.HIToolbox 0x00007fff4181f96b RunCurrentEventLoopInMode + 292 20 com.apple.HIToolbox 0x00007fff4181f5ad ReceiveNextEventCommon + 355 21 com.apple.HIToolbox 0x00007fff4181f436 _BlockUntilNextEventMatchingListInModeWithFilter + 64 22 com.apple.AppKit 0x00007fff3fbb9987 _DPSNextEvent + 965 23 com.apple.AppKit 0x00007fff3fbb871f -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1361 24 com.apple.AppKit 0x00007fff3fbb283c -[NSApplication run] + 699 25 com.apple.AppKit 0x00007fff3fba1d7c NSApplicationMain + 777 26 libdyld.dylib 0x00007fff6e4803d5 start + 1 ```
This corroborates what pablo told me and pavel found. Tx
If one starts up the latest Pharo 8 64-bit image using an assert VM with leak checking turned on one ends that the image is already corrupted. It is not the script, nor the VM that is at fault, but the initial image, which is already corrupted.
Here's what I get when I launch the latest 64-bit 8.0 image using an assert VM with leak checking turned on (commentary after the run):
``` $ pharo64cavm --lldb --leakcheck 1 gccrash8-64.image gccrash.st
run --leakcheck 1 /Users/eliot/Documents/Pharo/images/Pharo 8.0 - 64bit/gccrash8-64.image gccrash.st
(lldb) target create "/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo" Current executable set to '/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo' (x86_64). (lldb) settings set -- target.run-args "/Users/eliot/Documents/Pharo/images/Pharo 8.0 - 64bit/gccrash8-64.image" "gccrash.st" (lldb) b Pharo`warning Breakpoint 1: where = Pharo`warning + 9 at gcc3x-cointerp.c:44, address = 0x0000000100002079 (lldb) run --leakcheck 1 gccrash8-64.image gccrash.st Process 13469 launched: '/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo' (x86_64) 2019-11-13 21:53:49.806855-0800 Pharo[13469:7168447] MessageTracer: load_domain_whitelist_search_tree:73: Search tree file's format version number (0) is not supported 2019-11-13 21:53:49.806878-0800 Pharo[13469:7168447] MessageTracer: Falling back to default whitelist object leak in 0x110cc6720 @ 0 = 0x11a542c50 object leak in 0x110d402e0 @ 1 = 0x11a542c50 object leak in 0x110d40418 @ 1 = 0x11a702018 object leak in 0x110d40790 @ 1 = 0x11a53bec8 object leak in 0x110d557f8 @ 4 = 0x11a542c88 object leak in 0x110e1a940 @ 0 = 0x11a8ffb48 object leak in 0x110e1a9a0 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1a9f8 @ 0 = 0x11a70bdc0 object leak in 0x110e1a9f8 @ 5 = 0x11a702018 object leak in 0x110e1aa30 @ 4 = 0x11a73a320 object leak in 0x110e1aa30 @ 5 = 0x11a702050 object leak in 0x110e1aa68 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1ab08 @ 1 = 0x11a70bdc0 object leak in 0x110e1ab20 @ 1 = 0x11a70bdf8 object leak in 0x110e1ab38 @ 1 = 0x11a70be68 object leak in 0x110e1ab50 @ 1 = 0x11a73a7a8 object leak in 0x110e1ab68 @ 1 = 0x11a70bea0 object leak in 0x110e1ab80 @ 1 = 0x11a70bed8 object leak in 0x110e1ab98 @ 1 = 0x11a70bf10 object leak in 0x110e1abb0 @ 1 = 0x11a70be30 object leak in 0x110e1ae00 @ 0 = 0x11a73af40 object leak in 0x110e1ae20 @ 1 = 0x11a73af68 object leak in 0x110e1e258 @ 0 = 0x11a73af98 object leak in 0x110e1e290 @ 5 = 0x11a7058b0 object leak in 0x110e1e448 @ 1 = 0x11a73bbe8 object leak in 0x110e1e6a0 @ 1 = 0x11a73d3a0 object leak in 0x110e1e6b8 @ 1 = 0x11a73d478 object leak in 0x110e1ea48 @ 1 = 0x11a73f950 object leak in 0x110e1ebe0 @ 1 = 0x11a740a48 object leak in 0x110e1f348 @ 1 = 0x11a745698 object leak in 0x110e1f648 @ 1 = 0x11a7475a0 object leak in 0x110e1fbb8 @ 1 = 0x11a74ad98 object leak in 0x110e20008 @ 1 = 0x11a74db10 object leak in 0x110e20068 @ 1 = 0x11a74dee8 object leak in 0x110e20128 @ 1 = 0x11a74e650 object leak in 0x110e20188 @ 1 = 0x11a74ea28 object leak in 0x110e20398 @ 1 = 0x11a74feb0 object leak in 0x110e207e8 @ 1 = 0x11a752b08 object leak in 0x110e217d8 @ 1 = 0x11a75cbb8 object leak in 0x110e21958 @ 1 = 0x11a75dad0 object leak in 0x110e21fa0 @ 1 = 0x11a761ac0 object leak in 0x110e22180 @ 1 = 0x11a762dc8 object leak in 0x110e222d0 @ 1 = 0x11a763b00 object leak in 0x110e22420 @ 1 = 0x11a764820 object leak in 0x110e225b8 @ 1 = 0x11a765828 object leak in 0x110e22798 @ 1 = 0x11a766ad0 object leak in 0x110e22948 @ 1 = 0x11a767bf8 object leak in 0x110e22bd0 @ 1 = 0x11a7695f0 object leak in 0x110e22cf0 @ 1 = 0x11a76a1a8 object leak in 0x110e22d80 @ 1 = 0x11a76a760 object leak in 0x110e234a0 @ 1 = 0x11a76f050 object leak in 0x110e23ce0 @ 1 = 0x11a774438 Pharo was compiled with optimization - stepping may behave oddly; variables may not be available. Process 13469 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x0000000100002079 Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] 41 sqInt warnpid, erroronwarn; 42 void 43 warning(char *s) { /* Print an error message but don't necessarily exit. */ -> 44 if (erroronwarn) error(s); 45 if (warnpid) 46 printf("\n%s pid %ld\n", s, (long)warnpid); 47 else Target 0: (Pharo) stopped. (lldb) thr b * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 * frame #0: 0x0000000100002079 Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] frame #1: 0x000000010003872f Pharo`runLeakCheckerFor(gcModes=1) at gcc3x-cointerp.c:56761 [opt] frame #2: 0x0000000100012d1e Pharo`loadInitialContext at gcc3x-cointerp.c:65476 [opt] frame #3: 0x000000010000231d Pharo`interpret at gcc3x-cointerp.c:2770 [opt] frame #4: 0x00000001000bd0c9 Pharo`-[sqSqueakMainApplication runSqueak](self=0x000000010045e840, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #5: 0x00007fff319e809c Foundation`__NSFirePerformWithOrder + 360 frame #6: 0x00007fff2f85c257 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 23 frame #7: 0x00007fff2f85c17f CoreFoundation`__CFRunLoopDoObservers + 527 frame #8: 0x00007fff2f83e6f8 CoreFoundation`__CFRunLoopRun + 1240 frame #9: 0x00007fff2f83df93 CoreFoundation`CFRunLoopRunSpecific + 483 frame #10: 0x00007fff2eb28d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #11: 0x00007fff2eb28a0f HIToolbox`ReceiveNextEventCommon + 366 frame #12: 0x00007fff2eb28884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #13: 0x00007fff2cdd7a3b AppKit`_DPSNextEvent + 2085 frame #14: 0x00007fff2d56de34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #15: 0x00007fff2cdcc84d AppKit`-[NSApplication run] + 764 frame #16: 0x00007fff2cd9ba3a AppKit`NSApplicationMain + 804 frame #17: 0x00007fff5777c015 libdyld.dylib`start + 1 ```
My script pharo64cavm runs an assert VM, specifically /Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo. Supplying the --lldb argument has the script launch lldb (a low-level debugger for native executables) on the VM.
This command places a breakpoint in the error/warning output routine called when the VM wants to report that asserts have failed, leaks in the heap have been found, etc.
``` (lldb) b Pharo`warning ``` This command then launches the VM under the control of the debugger. ``` run --leakcheck 1 gccrash8-64.image gccrash.st ``` The arguments to leak check are a combination of the following flags: ``` SpurMemoryManager<<setCheckForLeaks: integerFlags " 0 = do nothing. 1 = check for leaks on fullGC (GCModeFull). 2 = check for leaks on scavenger (GCModeNewSpace). 4 = check for leaks on incremental (GCModeIncremental) 8 = check for leaks on become (GCModeBecome) 16 = check for leaks on image segments (GCModeImageSegment)" checkForLeaks := integerFlags ``` If GCModeFull is set then the VM performs a leak check on loading the initial image. From the back trace you can see that the Vm has not yet started running, loadInitialCOntext being the routine that sets up the VM to run from the context that performed the snapshot: ``` Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] frame #1: 0x000000010003872f Pharo`runLeakCheckerFor(gcModes=1) at gcc3x-cointerp.c:56761 [opt] frame #2: 0x0000000100012d1e Pharo`loadInitialContext at gcc3x-cointerp.c:65476 [opt] frame #3: 0x000000010000231d Pharo`interpret at gcc3x-cointerp.c:2770 [opt] frame #4: 0x00000001000bd0c9 Pharo`-[sqSqueakMainApplication runSqueak](self=0x000000010045e840, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #5: 0x00007fff319e809c Foundation`__NSFirePerformWithOrder + 360 ```
The following leak report shows that there are many leaks in this image: ``` object leak in 0x110cc6720 @ 0 = 0x11a542c50 object leak in 0x110d402e0 @ 1 = 0x11a542c50 object leak in 0x110d40418 @ 1 = 0x11a702018 object leak in 0x110d40790 @ 1 = 0x11a53bec8 object leak in 0x110d557f8 @ 4 = 0x11a542c88 object leak in 0x110e1a940 @ 0 = 0x11a8ffb48 object leak in 0x110e1a9a0 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1a9f8 @ 0 = 0x11a70bdc0 object leak in 0x110e1a9f8 @ 5 = 0x11a702018 object leak in 0x110e1aa30 @ 4 = 0x11a73a320 object leak in 0x110e1aa30 @ 5 = 0x11a702050 object leak in 0x110e1aa68 @ 0 = 0xfffffffffb3d0000 ... ```
Let's take a look at some of these objects. In lldb we can call the VM's debug printing routines, just as we can in the simulator: ``` (lldb) call printOop(0x110cc6720) 0x110cc6720: a(n) FreeTypeCacheEntry 0x11a542c50 0x110ca0fb8 0x221 0x21 0x81ffae4000000004 0x110d25ea8 (lldb) call printOop(0x110d402e0) 0x110d402e0: a(n) Association 0x21 0x11a542c50 (lldb) call printOop(0x110d402e0) 0x110d402e0: a(n) Association 0x21 0x11a542c50 (sqInt) $1 = 0 (lldb) print whereIs(0x11a542c50) (char *) $2 = 0x0000000100110864 " is no where obvious" (lldb) call printOop(0x110d40418) 0x110d40418: a(n) Association 0x16000000001 0x11a702018 (sqInt) $3 = 0 (lldb) call printOop(0x110d40790) 0x110d40790: a(n) Association 0x21 0x11a53bec8 (sqInt) $4 = 0 (lldb) call printOop(0x110d557f8) 0x110d557f8: a(n) FreeTypeCacheEntry 0x110d55ad0 0x110ca0fb8 0x181 0x7c1 0x11a542c88 0x110d55718 ```
So the first suspect (to me) looks like external C memory management in FreeType font management. Let me suggest you add a step in the release process which involves checking the validity of images before they're released. Let me also suggest that you appoint a team to look at FreeType font management using the leak checker, et al, to find and fix these issues which I think have been around for quite a while.
An interesting question to ask here is can you tag the image memory as read only during a FFI call out for debugging purposes? If writes to image memory are required can they sandboxed? If writes to a display area are required can that be protected by no read/write pages before/after the screen buffer to trap overwrites or reads?
.... John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk
On Wed, Nov 13, 2019 at 10:12 PM, Eliot Miranda notifications@github.com wrote:
If one starts up the latest Pharo 8 64-bit image using an assert VM with leak checking turned on one ends that the image is already corrupted. It is not the script, nor the VM that is at fault, but the initial image, which is already corrupted.
Here's what I get when I launch the latest 64-bit 8.0 image using an assert VM with leak checking turned on (commentary after the run):
$ pharo64cavm --lldb --leakcheck 1 gccrash8-64.image gccrash.st
run --leakcheck 1 /Users/eliot/Documents/Pharo/images/Pharo 8.0 - 64bit/gccrash8-64.image gccrash.st
(lldb) target create "/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo" Current executable set to '/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo' (x86_64). (lldb) settings set -- target.run-args "/Users/eliot/Documents/Pharo/images/Pharo 8.0 - 64bit/gccrash8-64.image" "gccrash.st" (lldb) b Pharo`warning Breakpoint 1: where = Pharo`warning + 9 at gcc3x-cointerp.c:44, address = 0x0000000100002079 (lldb) run --leakcheck 1 gccrash8-64.image gccrash.st Process 13469 launched: '/Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo' (x86_64) 2019-11-13 21:53:49.806855-0800 Pharo[13469:7168447] MessageTracer: load_domain_whitelist_search_tree:73: Search tree file's format version number (0) is not supported 2019-11-13 21:53:49.806878-0800 Pharo[13469:7168447] MessageTracer: Falling back to default whitelist object leak in 0x110cc6720 @ 0 = 0x11a542c50 object leak in 0x110d402e0 @ 1 = 0x11a542c50 object leak in 0x110d40418 @ 1 = 0x11a702018 object leak in 0x110d40790 @ 1 = 0x11a53bec8 object leak in 0x110d557f8 @ 4 = 0x11a542c88 object leak in 0x110e1a940 @ 0 = 0x11a8ffb48 object leak in 0x110e1a9a0 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1a9f8 @ 0 = 0x11a70bdc0 object leak in 0x110e1a9f8 @ 5 = 0x11a702018 object leak in 0x110e1aa30 @ 4 = 0x11a73a320 object leak in 0x110e1aa30 @ 5 = 0x11a702050 object leak in 0x110e1aa68 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1ab08 @ 1 = 0x11a70bdc0 object leak in 0x110e1ab20 @ 1 = 0x11a70bdf8 object leak in 0x110e1ab38 @ 1 = 0x11a70be68 object leak in 0x110e1ab50 @ 1 = 0x11a73a7a8 object leak in 0x110e1ab68 @ 1 = 0x11a70bea0 object leak in 0x110e1ab80 @ 1 = 0x11a70bed8 object leak in 0x110e1ab98 @ 1 = 0x11a70bf10 object leak in 0x110e1abb0 @ 1 = 0x11a70be30 object leak in 0x110e1ae00 @ 0 = 0x11a73af40 object leak in 0x110e1ae20 @ 1 = 0x11a73af68 object leak in 0x110e1e258 @ 0 = 0x11a73af98 object leak in 0x110e1e290 @ 5 = 0x11a7058b0 object leak in 0x110e1e448 @ 1 = 0x11a73bbe8 object leak in 0x110e1e6a0 @ 1 = 0x11a73d3a0 object leak in 0x110e1e6b8 @ 1 = 0x11a73d478 object leak in 0x110e1ea48 @ 1 = 0x11a73f950 object leak in 0x110e1ebe0 @ 1 = 0x11a740a48 object leak in 0x110e1f348 @ 1 = 0x11a745698 object leak in 0x110e1f648 @ 1 = 0x11a7475a0 object leak in 0x110e1fbb8 @ 1 = 0x11a74ad98 object leak in 0x110e20008 @ 1 = 0x11a74db10 object leak in 0x110e20068 @ 1 = 0x11a74dee8 object leak in 0x110e20128 @ 1 = 0x11a74e650 object leak in 0x110e20188 @ 1 = 0x11a74ea28 object leak in 0x110e20398 @ 1 = 0x11a74feb0 object leak in 0x110e207e8 @ 1 = 0x11a752b08 object leak in 0x110e217d8 @ 1 = 0x11a75cbb8 object leak in 0x110e21958 @ 1 = 0x11a75dad0 object leak in 0x110e21fa0 @ 1 = 0x11a761ac0 object leak in 0x110e22180 @ 1 = 0x11a762dc8 object leak in 0x110e222d0 @ 1 = 0x11a763b00 object leak in 0x110e22420 @ 1 = 0x11a764820 object leak in 0x110e225b8 @ 1 = 0x11a765828 object leak in 0x110e22798 @ 1 = 0x11a766ad0 object leak in 0x110e22948 @ 1 = 0x11a767bf8 object leak in 0x110e22bd0 @ 1 = 0x11a7695f0 object leak in 0x110e22cf0 @ 1 = 0x11a76a1a8 object leak in 0x110e22d80 @ 1 = 0x11a76a760 object leak in 0x110e234a0 @ 1 = 0x11a76f050 object leak in 0x110e23ce0 @ 1 = 0x11a774438 Pharo was compiled with optimization - stepping may behave oddly; variables may not be available. Process 13469 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x0000000100002079 Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] 41 sqInt warnpid, erroronwarn; 42 void 43 warning(char *s) { /* Print an error message but don't necessarily exit. */
-> 44 if (erroronwarn) error(s); 45 if (warnpid) 46 printf("\n%s pid %ld\n", s, (long)warnpid); 47 else Target 0: (Pharo) stopped. (lldb) thr b
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
- frame #0: 0x0000000100002079 Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] frame #1: 0x000000010003872f Pharo`runLeakCheckerFor(gcModes=1) at gcc3x-cointerp.c:56761 [opt] frame #2: 0x0000000100012d1e Pharo`loadInitialContext at gcc3x-cointerp.c:65476 [opt] frame #3: 0x000000010000231d Pharo`interpret at gcc3x-cointerp.c:2770 [opt] frame #4: 0x00000001000bd0c9 Pharo`-[sqSqueakMainApplication runSqueak](self=0x000000010045e840, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #5: 0x00007fff319e809c Foundation`__NSFirePerformWithOrder + 360 frame #6: 0x00007fff2f85c257 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 23 frame #7: 0x00007fff2f85c17f CoreFoundation`__CFRunLoopDoObservers + 527 frame #8: 0x00007fff2f83e6f8 CoreFoundation`__CFRunLoopRun + 1240 frame #9: 0x00007fff2f83df93 CoreFoundation`CFRunLoopRunSpecific + 483 frame #10: 0x00007fff2eb28d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #11: 0x00007fff2eb28a0f HIToolbox`ReceiveNextEventCommon + 366 frame #12: 0x00007fff2eb28884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #13: 0x00007fff2cdd7a3b AppKit`_DPSNextEvent + 2085 frame #14: 0x00007fff2d56de34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #15: 0x00007fff2cdcc84d AppKit`-[NSApplication run] + 764 frame #16: 0x00007fff2cd9ba3a AppKit`NSApplicationMain + 804 frame #17: 0x00007fff5777c015 libdyld.dylib`start + 1
My script pharo64cavm runs an assert VM, specifically /Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo. Supplying the --lldb argument has the script launch lldb (a low-level debugger for native executables) on the VM.
This command places a breakpoint in the error/warning output routine called when the VM wants to report that asserts have failed, leaks in the heap have been found, etc.
(lldb) b Pharo`warning
This command then launches the VM under the control of the debugger.
run --leakcheck 1 gccrash8-64.image gccrash.st
The arguments to leak check are a combination of the following flags:
SpurMemoryManager<<setCheckForLeaks: integerFlags " 0 = do nothing. 1 = check for leaks on fullGC (GCModeFull). 2 = check for leaks on scavenger (GCModeNewSpace). 4 = check for leaks on incremental (GCModeIncremental) 8 = check for leaks on become (GCModeBecome) 16 = check for leaks on image segments (GCModeImageSegment)" checkForLeaks := integerFlags
If GCModeFull is set then the VM performs a leak check on loading the initial image. From the back trace you can see that the Vm has not yet started running, loadInitialCOntext being the routine that sets up the VM to run from the context that performed the snapshot:
Pharo`warning(s="checkHeapIntegrityclassIndicesShouldBeValid(0, 1) 56761") at gcc3x-cointerp.c:44 [opt] frame #1: 0x000000010003872f Pharo`runLeakCheckerFor(gcModes=1) at gcc3x-cointerp.c:56761 [opt] frame #2: 0x0000000100012d1e Pharo`loadInitialContext at gcc3x-cointerp.c:65476 [opt] frame #3: 0x000000010000231d Pharo`interpret at gcc3x-cointerp.c:2770 [opt] frame #4: 0x00000001000bd0c9 Pharo`-[sqSqueakMainApplication runSqueak](self=0x000000010045e840, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #5: 0x00007fff319e809c Foundation`__NSFirePerformWithOrder + 360
The following leak report shows that there are many leaks in this image:
object leak in 0x110cc6720 @ 0 = 0x11a542c50 object leak in 0x110d402e0 @ 1 = 0x11a542c50 object leak in 0x110d40418 @ 1 = 0x11a702018 object leak in 0x110d40790 @ 1 = 0x11a53bec8 object leak in 0x110d557f8 @ 4 = 0x11a542c88 object leak in 0x110e1a940 @ 0 = 0x11a8ffb48 object leak in 0x110e1a9a0 @ 0 = 0xfffffffffb3d0000 object leak in 0x110e1a9f8 @ 0 = 0x11a70bdc0 object leak in 0x110e1a9f8 @ 5 = 0x11a702018 object leak in 0x110e1aa30 @ 4 = 0x11a73a320 object leak in 0x110e1aa30 @ 5 = 0x11a702050 object leak in 0x110e1aa68 @ 0 = 0xfffffffffb3d0000 ...
Let's take a look at some of these objects. In lldb we can call the VM's debug printing routines, just as we can in the simulator:
(lldb) call printOop(0x110cc6720) 0x110cc6720: a(n) FreeTypeCacheEntry 0x11a542c50 0x110ca0fb8 0x221 0x21 0x81ffae4000000004 0x110d25ea8 (lldb) call printOop(0x110d402e0) 0x110d402e0: a(n) Association 0x21 0x11a542c50 (lldb) call printOop(0x110d402e0) 0x110d402e0: a(n) Association 0x21 0x11a542c50 (sqInt) $1 = 0 (lldb) print whereIs(0x11a542c50) (char *) $2 = 0x0000000100110864 " is no where obvious" (lldb) call printOop(0x110d40418) 0x110d40418: a(n) Association 0x16000000001 0x11a702018 (sqInt) $3 = 0 (lldb) call printOop(0x110d40790) 0x110d40790: a(n) Association 0x21 0x11a53bec8 (sqInt) $4 = 0 (lldb) call printOop(0x110d557f8) 0x110d557f8: a(n) FreeTypeCacheEntry 0x110d55ad0 0x110ca0fb8 0x181 0x7c1 0x11a542c88 0x110d55718
So the first suspect (to me) looks like external C memory management in FreeType font management. Let me suggest you add a step in the release process which involves checking the validity of images before they're released. Let me also suggest that you appoint a team to look at FreeType font management using the leak checker, et al, to find and fix these issues which I think have been around for quite a while.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, [view it on GitHub](https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=no...), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AIJPEW554N2J2HMLDO53QJTQTT...).
John, one could easily add that facility, but I believe that the problem is more likely to do with dangling pointers than FreeType writing into the heap. I suspect that what happens is that on a previous save or restart, pointers to C memory that was allocated in the run before the current one are not invalidated and still used. I believe the problem is that the FFI is not being used properly and that it is not at fault. Instead, stale pointers are being followed abd memory corruption occurring.
As I said above the necessity is in checking that a valid image is created and that stake pointers are invalidated.
This is an age old problem with Smalltalk programs that use external memory, external handles, descriptors, etc. There is a style which desks well with this and it should be followed.
1. when creating references to external C memory that may persist across a snapshot (such as to an opened external font object), the object that references the external memory should be registered in some set.
2. The Smalltalk system sends a post-snapshot resumption message to any and all classes that register for this service, with an argument that indicates whether the system is continuing after a snapshot or loading the snapshot in a fresh run.
3. When loading the snapshot in a fresh run all objects that reference external C memory are visited very early in system startup and each ensures that it invalidates any and all pointers to external memory, file descriptors, etc.
Using this style we do not have to close and reopen around a snapshot, but we do have to perform the invalidation early enough so that there is no chance of accessing anything external before all invalidations are complete.
Further, using a registry of objects is much much better than using, for example, allInstances because typically there are few (tens, hundreds at most, not thousands) of objects that reference external resources, and they may be of various classes, so the registry is able to reference them in more or less linear time in the size of the registry, independent of image size, while using allInstances accesses objects in time proportional to the product of the number of classes and the image size. Clearly this does not scale as the system gets more complex and the image size grows. Startup time is very important. I led the VisualWorks team through this exercise and we were able to reduce start up times from hundreds of milliseconds to forty milliseconds (IIRC) in the VW 3.0 timeframe.
This issue is also recurrent in Pharo 7:
``` Image ----- Pharo7.0.4 Build information: Pharo-7.0.4+build.168.sha.ccd1f6489120f58ddeacb2cac77cd3a0f0dcfbe6 (64 Bit) Unnamed
Virtual Machine --------------- /Users/jurajkubelka/Pharo/vms/70-x64/Pharo.app/Contents/MacOS/Pharo CoInterpreter VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6 Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Mac OS X built on Jan 5 2019 19:11:02 UTC Compiler: 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31) VMMaker versionString VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6 Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git CoInterpreter VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 StackToRegisterMappingCogit VMMaker.oscog-eem.2504 uuid: a00b0fad-c04c-47a6-8a11-5dbff110ac11 Jan 5 2019 ```
This also happens in Pharo 6: ``` Image ----- Pharo6.0 Latest update: #60547 Unnamed
Virtual Machine --------------- /Users/jurajkubelka/Pharo/vms/61-x64/Pharo.app/Contents/MacOS/Pharo CoInterpreter VMMaker.oscog-eem.2401 uuid: 29232e0e-c9e3-41d8-ae75-519db862e02c Jun 28 2018 StackToRegisterMappingCogit VMMaker.oscog-eem.2401 uuid: 29232e0e-c9e3-41d8-ae75-519db862e02c Jun 28 2018 VM: 201806281256 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Thu Jun 28 14:56:30 2018 CommitHash: a8a1dc1 Plugins: 201806281256 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Mac OS X built on Jun 28 2018 13:07:33 UTC Compiler: 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31) VMMaker versionString VM: 201806281256 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Thu Jun 28 14:56:30 2018 CommitHash: a8a1dc1 Plugins: 201806281256 https://github.com/OpenSmalltalk/opensmalltalk-vm.git CoInterpreter VMMaker.oscog-eem.2401 uuid: 29232e0e-c9e3-41d8-ae75-519db862e02c Jun 28 2018 StackToRegisterMappingCogit VMMaker.oscog-eem.2401 uuid: 29232e0e-c9e3-41d8-ae75-519db862e02c Jun 28 2018 ```
Just to confirm that it's probably not a garbageCollect problem, I could not reproduce in latest Squeak trunk. I did not use Zinc because it's too much difficult to install in Squeak and just replaced with WebClient. STON is available (installed thru Squit/Squot git support):
``` | aJson anArray | aJson := WebClient httpGet: 'https://data.nasa.gov/resource/y77d-th95.json'. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson content). Smalltalk saveSession ] ]. ```
The resulting image file is 540Mbytes long.
Hi Eliot and Nicolas,
Nicolas, thanks for checking on Squeak, that is useful to know.
Juraj and I have both built assert VMs and been trying to reproduce Eliot's findings.
If I add a couple of print statements to the code and run with a normal VM, I get a number of:
``` evicted zombie process from run queue ```
messages.
If I run gccrash.st with the assert VM I get:
``` ((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170
((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170 ```
before the process seg faults.
Do these provide any additional information to help track down the issue? (I'll include more complete information below)
I tried running a headless VM and printing instance counts about FreeType external objects in a clean image:
``` $ vmh/pharo Pharo.image eval "'ftcount.st' asFileReference fileIn" FTBBox -> 0 FTBitmap -> 0 FTBitmapSize -> 0 FTCharMapRec -> 0 FTFaceRec -> 0 FTGeneric -> 0 FTGlyphMetrics -> 0 FTGlyphSlotRec -> 0 FTListRec -> 0 FTMatrix -> 0 FTOutline -> 0 FTSizeMetrics -> 0 FTSizeRec -> 0 FTVector -> 0 FT2Handle -> 0 FT2Face -> 0 FT2Library -> 1 a FT2Library(@ 16r00000000)<0x0> File @ ftcount.st ```
Once the image has been started normally, the pointer in FT2Library becomes non-zero.
Which to me seems to suggest that rather than the image being delivered in a corrupt state, it's something that happens early in session resumption.
More complete crash dump with normal VM:
``` $ vmga/pharo-ui Pharo.image gccrash.st Start Loop: 1 Start Loop: 2 Start Loop: 3
...
Start Loop: 13 Start
evicted zombie process from run queue Loop: 14 Start
evicted zombie process from run queue
evicted zombie process from run queue
evicted zombie process from run queue Loop: 15
...
Start Loop: 50 Start
Segmentation fault Fri Nov 15 08:14:26 2019
/home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo Pharo VM version: 5.0-201902062351 Wed Feb 6 23:59:26 UTC 2019 gcc 4.8 [Production Spur 64-bit VM] Built from: CoInterpreter VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 6 2019 With: StackToRegisterMappingCogit VMMaker.oscog-eem.2509 uuid: 91e81f64-95de-4914-a960-8f842be3a194 Feb 6 2019 Revision: VM: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Wed Feb 6 15:51:18 2019 CommitHash: a838346b Plugins: 201902062351 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Build host: Linux travis-job-f2b22483-7f84-414f-b833-69f69518c685 4.4.0-101-generic #124~14.04.1-Ubuntu SMP Fri Nov 10 19:05:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux plugin path: /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351 [default: /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/]
C stack backtrace & registers: rax 0x00000000 rbx 0x00000011 rcx 0x00000033 rdx 0x00000000 rdi 0x00000011 rsi 0x00000011 rbp 0x09928f00 rsp 0x7ffe8478f9a0 r8 0x000d131c r9 0x006898e7 r10 0x7ffe8478fa40 r11 0x5dce5052 r12 0x099b0a08 r13 0x80000000000000 r14 0x02cefb70 r15 0x0993b688 rip 0x0043d9f0 */home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x43d9f0] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x41b0b3] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x41cb0e] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f2b7ee8a890] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x43d9f0] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo(markAndTrace+0x141)[0x43f3b1] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x452fe4] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo(fullGC+0x46)[0x457366] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x457a50] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x457e33] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo(interpret+0xa1f5)[0x467625] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x468866] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo(interpret+0x246)[0x45d676] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo(main+0x2fa)[0x41a59a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f2b7eaa8b97] /home/alistair/pharo8/gccrash/vmga/pharo-vm/lib/pharo/5.0-201902062351/pharo[0x41a8c4] [0x0]
Smalltalk stack dump: 0x7ffe847d3b20 I SessionManager>launchSnapshot:andQuit: 0x2d744f8: a(n) SessionManager 0x7ffe847d3b90 I [] in SessionManager>snapshot:andQuit: 0x2d744f8: a(n) SessionManager 0x7ffe847d3bd0 I [] in INVALID RECEIVER>newProcess 0x2a39178 is in new space
Most recent primitives snip... wait snapshotPrimitive **IncrementalGC** **FullGC**
stack page bytes 8192 available headroom 5576 minimum unused headroom 4672
(Segmentation fault) vmga/pharo-ui: line 11: 322 Aborted (core dumped) "$DIR"/"pharo-vm/pharo" "$@" ```
Running with the assert VM in lldb:
``` (lldb) run Pharo.image gccrash.st Process 251 launched: '/home/alistair/pharo8/gccrash/vm/lib/pharo/5.0-201911140223/pharo' (x86_64)
((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170
((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170 Process 251 stopped * thread #1, name = 'pharo', stop reason = signal SIGIO frame #0: 0x00007ffff6a2403f libc.so.6`__select + 95 libc.so.6`__select: -> 0x7ffff6a2403f <+95>: cmpq $-0x1000, %rax ; imm = 0xF000 0x7ffff6a24045 <+101>: ja 0x7ffff6a2407a ; <+154> 0x7ffff6a24047 <+103>: movl %r9d, %edi 0x7ffff6a2404a <+106>: movl %eax, 0xc(%rsp) (lldb) bt * thread #1, name = 'pharo', stop reason = signal SIGIO * frame #0: 0x00007ffff6a2403f libc.so.6`__select + 95 frame #1: 0x00000000004a91ce pharo`aioPoll(microSeconds=50000) at aio.c:316 frame #2: 0x00007ffff358d6de vm-display-X11.so`display_ioRelinquishProcessorForMicroseconds(microSeconds=<unavailable>) at sqUnixX11.c:4943 frame #3: 0x0000000000419d3e pharo`ioRelinquishProcessorForMicroseconds(us=<unavailable>) at sqUnixMain.c:588 frame #4: 0x0000000000468de6 pharo`primitiveRelinquishProcessor at gcc3x-cointerp.c:34635 frame #5: 0x00000000004235f9 pharo`interpret at gcc3x-cointerp.c:6203 frame #6: 0x000000000042b5f1 pharo`enterSmalltalkExecutiveImplementation at gcc3x-cointerp.c:16294 frame #7: 0x000000000041d6d1 pharo`interpret at gcc3x-cointerp.c:2772 frame #8: 0x000000000041bae6 pharo`main(argc=3, argv=0x00007fffffffe288, envp=<unavailable>) at sqUnixMain.c:2150 frame #9: 0x00007ffff692eb97 libc.so.6`__libc_start_main + 231 frame #10: 0x00000000004197ca pharo`_start + 42 ```
Thanks, Eliot!
Hi Alistair,
to get lldb to break when it reports an assert failure so that you can investigate, put a breakpoint on warning:
lldb> b Pharo`warning
Hi All, to simplify checking there is now a generated image checker. This is a cut-down VM that only loads an image and runs the leak checker, answering 0 (unix's OK exit code) if the image is free of leaks, and non-zero if it is leaky. The program takes a -verbose/--verbose argument that will cause it to list the leaks or write a reassuring message if there are none.
This can be built for mac in build.macos64x64/squeak.stack.spur & build.macos32x86/squeak.stack.spur by saying make production (image leak checker) and it produces a program called validImage in squeak.stack.spur/build/vm. I saw that it took 2 seconds to load and check a 1Gb image so it should be fast enough to be used in a CI context.
See https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/f83bde2bf5c325ce26f...
HTH
Bravo, this is a really good idea :-)
Dave
On Sat, Nov 16, 2019 at 04:47:01PM -0800, Eliot Miranda wrote:
Hi All, to simplify checking there is now a generated image checker. This is a cut-down VM that only loads an image and runs the leak checker, answering 0 (unix's OK exit code) if the image is free of leaks, and non-zero if it is leaky. The program takes a -verbose/--verbose argument that will cause it to list the leaks or write a reassuring message if there are none.
This can be built for mac in build.macos64x64/squeak.stack.spur & build.macos32x86/squeak.stack.spur by saying make production (image leak checker) and it produces a program called validImage in squeak.stack.spur/build/vm. I saw that it took 2 seconds to load and check a 1Gb image so it should be fast enough to be used in a CI context.
See https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/f83bde2bf5c325ce26f...
HTH
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444#issuecomment-55...
Hi Eliot,
I've spent some more time trying to track this down...
I've been working with a minimal pharo image, which can be downloaded from:
http://files.pharo.org/image/80/latest-minimal-64.zip
The minimal image doesn't have FreeType yet loaded, so we can rule out FreeType as the cause of this particular issue (not to say that it doesn't have problems).
Running the following script with the minimal image and a debug VM:
``` | aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 1 to: 400 do: [ :i | Stdio stdout << 'ExternalAddress: '; print: ExternalAddress instanceCount; lf; << 'Start'; lf; flush. aStream nextPutAll: (STON fromString: aJson contents). Smalltalk garbageCollect. Stdio stdout << 'Loop: '; print: i; lf; flush ] ]. ```
Shows two things:
1. There are no instances of ExternalAddress, so the chance of this being ffi related seems quite small. 2. The script runs to completion.
So there appears to be no memory corruption up to this stage.
Modifying the script once more to save the image instead of just garbage collecting:
``` | aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 1 to: 400 do: [ :i | Stdio stdout << 'ExternalAddress: '; print: ExternalAddress instanceCount; lf; << 'Start'; lf; flush. aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveImageInFileNamed: 'Save.', i asString, '.image'. Stdio stdout << 'Loop: '; print: i; lf; flush ] ]. ```
Results in the segmentation fault. In this case it was while saving the 90th image:
``` $ ./validImage Save.88.image $ ./validImage Save.89.image Segmentation fault (core dumped) ```
`Save.90.image` doesn't exist, but `Save.90.changes` does.
I'll try attaching a file containing the terminal output (as much as was buffered).
Please let me know if you disagree with any of my reasoning.
The only difference between the two scripts is that the second one writes the image to disk, which seems to suggest that it's the image saving that could be the cause of the issue.
What do you think?
Thanks! Alistair
To paraphrase github's error message: `We don't support uploading text files, try again with TXT`.
A short extract:
``` (isMarked(oop)) || (obj == (hiddenRootsObject())) 60126
isPostMobile(fwd) 60129
isMarked(oop) 60112
isPostMobile(fwd) 60115
(isMarked(oop)) || (obj == (hiddenRootsObject())) 60126
isPostMobile(fwd) 60129 Loop: 89 ExternalAddress: 0 Start
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
objCouldBeClassObj(classObj) 51687
Segmentation fault Mon Nov 18 13:20:09 2019
/dev/shm/minimal/vmdbg/lib/pharo/5.0-201911160620/pharo Pharo VM version: 5.0-201911160620 Mon Nov 18 11:31:44 CET 2019 clang [Debug Spur 64-bit VM] Built from: CoInterpreter VMMaker.oscog-eem.2585 uuid: 5282e96d-1d2e-4039-a905-429834c37da2 Nov 18 2019 With: StackToRegisterMappingCogit VMMaker.oscog-eem.2585 uuid: 5282e96d-1d2e-4039-a905-429834c37da2 Nov 18 2019 Revision: VM: 201911160620 alistair@alistair-Precision-3541:opensmalltalk-vm/opensmalltalk-vm Date: Fri Nov 15 22:20:29 2019 CommitHash: 21112274a Plugins: 201911160620 alistair@alistair-Precision-3541:opensmalltalk-vm/opensmalltalk-vm Build host: Linux 2273a8715b2d 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux plugin path: vmdbg/lib/pharo/5.0-201911160620 [default: /dev/shm/minimal/vmdbg/lib/pharo/5.0-201911160620/]
C stack backtrace & registers: rax 0x00000030 rbx 0x00000000 rcx 0x00000270 rdx 0x034c8810 rdi 0x00000033 rsi 0x00000000 rbp 0x7ffc9e69e070 rsp 0x7ffc9e69de80 r8 0x00000000 r9 0x00000022 r10 0xffffffde r11 0x00000000 r12 0x004198a0 r13 0x7ffc9e708c80 r14 0x00000000 r15 0x00000000 rip 0x0048762f *vmdbg/lib/pharo/5.0-201911160620/pharo(markAndTrace+0x87f)[0x48762f] vmdbg/lib/pharo/5.0-201911160620/pharo[0x41afa4] vmdbg/lib/pharo/5.0-201911160620/pharo[0x41e0d8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f90fa38c890] vmdbg/lib/pharo/5.0-201911160620/pharo(markAndTrace+0x87f)[0x48762f] vmdbg/lib/pharo/5.0-201911160620/pharo[0x481651] vmdbg/lib/pharo/5.0-201911160620/pharo(fullGC+0xea)[0x4805fa] vmdbg/lib/pharo/5.0-201911160620/pharo[0x4c9f65] vmdbg/lib/pharo/5.0-201911160620/pharo[0x4add41] vmdbg/lib/pharo/5.0-201911160620/pharo(interpret+0xa661)[0x42a6d1] vmdbg/lib/pharo/5.0-201911160620/pharo[0x43e032] vmdbg/lib/pharo/5.0-201911160620/pharo(interpret+0x140)[0x4201b0] vmdbg/lib/pharo/5.0-201911160620/pharo(main+0x2d6)[0x41d8d6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f90f9faab97] vmdbg/lib/pharo/5.0-201911160620/pharo(_start+0x2a)[0x4198ca] [0x0] ```
Just a bit more:
``` $ /dev/shm/minimal/validImage --verbose Save.88.image Image Save.88.image is free of leaks
$ /dev/shm/minimal/validImage --verbose Save.89.image | wc -l 3574
$ /dev/shm/minimal/validImage --verbose Save.89.image | head object leak in 0x5363788 @ 5 = 0x53e34e8 object leak in 0x5363788 @ 9 = 0x53e3538 object leak in 0x5363788 @ 10 = 0x53e3560 object leak in 0x5363788 @ 11 = 0x53e3580 object leak in 0x5363788 @ 16 = 0x53e35c0 object leak in 0x5363788 @ 17 = 0x53e35f0 object leak in 0x5363788 @ 21 = 0x53e3610 object leak in 0x5363848 @ 8 = 0x53e3680 object leak in 0x5363848 @ 10 = 0x53e36d0 object leak in 0x5363848 @ 12 = 0x53e3710 ```
P.P.S. It would be great to be able to `call validateImage()` from within the debugger.
If I run gccrash.st with the assert VM I get:
((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170 ((classIndex >= 1) && (classIndex <= (classTablePageSize()))) 50170
before the process seg faults.
Do these provide any additional information to help track down the issue? (I'll include more complete information below)
The way to use the assert failures to track down bugs is to run under the debugger, place a breakpoint in warning (the print function for assert failures), and then run. When you reach an assert fail you can then print the stack trace, etc. So the assert message on its own tells you something is wrong, but one needs to use the debugging facilities to get more information.
P.P.S. It would be great to be able to `call validateImage()` from within the debugger.
You can run the leak checker. You have to be careful because it scans stack frames, etc, and so needs to be called at a sensible point. Expsriencer shows what these points are, but generally around the invocation of a primitive is a good point, when the VM is polling for events, etc. Btu half way through building a frame is probably not a good idea, etc.
In any case here are the functions: runLeakCheckerForFreeSpace(sqInt gcModes) runLeakCheckerFor(sqInt gcModes) runLeakCheckerForexcludeUnmarkedObjsclassIndicesShouldBeValid(sqInt gcModes, sqInt excludeUnmarkedObjs, sqInt classIndicesShouldBeValid)
And the GC modes are in VMMaker, see GCModeFull, etc.
Aeolus.image$ finddefine GCMode ../spur64src/vm/gcc3x-cointerp.c #define GCModeBecome 8 #define GCModeFreeSpace 32 #define GCModeFull 1 #define GCModeImageSegment 16 #define GCModeIncremental 4 #define GCModeNewSpace 2
To this we should of course add GCModePreFFICall & GCModePostFFICall. I may implement this soon but am concentrating on getting the ARMv8 processor simulator working.
Hi Eliot,
Tracing back from the first time a warning is printed in the assert VM, it looks like what is happening is that a `BlockClosure` in the currently executing context is being moved during garbage collection, but that the FP isn't being updated.
I have a script which walks gdb to the point where the FP corruption occurs, but ignoring the multiple steps it takes to get to the correct iteration:
Setting a breakpoint in `scavengeReferentsOf()` at line 42259 in `gcc3x-cointerp.c` and then:
``` (gdb) list 42254 if (oopisGreaterThanOrEqualToandLessThan(referent, ((futureSpace()).start), futureSurvivorStart())) { 42255 newLocation = referent; 42256 foundNewReferentOrIsWeakling = 1; 42257 } 42258 else { 42259 newLocation = copyAndForward(referent); 42260 if (((newLocation & (tagMask())) == 0) 42261 && (oopisLessThan(newLocation, GIV(newSpaceLimit)))) { 42262 foundNewReferentOrIsWeakling = 1; 42263 } (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in BlockClosure>newProcess 0x118a728: a(n) BlockClosure
(gdb) n (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 0x118a728 is a forwarded object to 0x488fca0 of slot size 7 hdr8 ..... ```
Setting a breakpoint at line 46909:
``` (gdb) list 46904 /* begin initFutureSpaceStart */ 46905 oldStart = GIV(futureSurvivorStart); 46906 GIV(futureSurvivorStart) = (GIV(futureSpace).start); 46907 GIV(pastSpaceStart) = oldStart; 46908 assert(oopisGreaterThanOrEqualToandLessThanOrEqualTo(GIV(pastSpaceStart), ((pastSpace()).start), ((pastSpace()).limit))); 46909 GIV(freeStart) = ((eden()).start); 46910 /* begin initSpaceForAllocationCheck:limit: */ 46911 aNewSpace = (&(eden())); 46912 if (!(GIV(memory) == null)) { 46913 if (checkAllocFiller()) { (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 0x118a728 is a forwarded object to 0x488fca0 of slot size 7 hdr8 ..... (gdb) n 46911 aNewSpace = (&(eden())); (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 is in new space (gdb) ```
The VM actually crashes some time later, however the call stack shows the an invalid receiver.
Shouldn't the FP be updated to point to the moved `BlockClosure`?
I tried to investigate the sensitivity of crashes probability on the eden size and there is definitely some. In my testing setup, if the eden size was set around 2 MB, it crashed. For values around 1MB or 10 MB, there were no crashes or were significantly less probable. However, there seems to be no strict limit.
-- Pavel
st 20. 11. 2019 v 12:59 odesílatel Alistair Grant notifications@github.com napsal:
Hi Eliot,
Tracing back from the first time a warning is printed in the assert VM, it looks like what is happening is that a BlockClosure in the currently executing context is being moved during garbage collection, but that the FP isn't being updated.
I have a script which walks gdb to the point where the FP corruption occurs, but ignoring the multiple steps it takes to get to the correct iteration:
Setting a breakpoint in scavengeReferentsOf() at line 42259 in gcc3x-cointerp.c and then:
(gdb) list 42254 if (oopisGreaterThanOrEqualToandLessThan(referent, ((futureSpace()).start), futureSurvivorStart())) { 42255 newLocation = referent; 42256 foundNewReferentOrIsWeakling = 1; 42257 } 42258 else { 42259 newLocation = copyAndForward(referent); 42260 if (((newLocation & (tagMask())) == 0) 42261 && (oopisLessThan(newLocation, GIV(newSpaceLimit)))) { 42262 foundNewReferentOrIsWeakling = 1; 42263 } (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in BlockClosure>newProcess 0x118a728: a(n) BlockClosure
(gdb) n (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 0x118a728 is a forwarded object to 0x488fca0 of slot size 7 hdr8 .....
Setting a breakpoint at line 46909:
(gdb) list 46904 /* begin initFutureSpaceStart */ 46905 oldStart = GIV(futureSurvivorStart); 46906 GIV(futureSurvivorStart) = (GIV(futureSpace).start); 46907 GIV(pastSpaceStart) = oldStart; 46908 assert(oopisGreaterThanOrEqualToandLessThanOrEqualTo(GIV(pastSpaceStart), ((pastSpace()).start), ((pastSpace()).limit))); 46909 GIV(freeStart) = ((eden()).start); 46910 /* begin initSpaceForAllocationCheck:limit: */ 46911 aNewSpace = (&(eden())); 46912 if (!(GIV(memory) == null)) { 46913 if (checkAllocFiller()) { (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 0x118a728 is a forwarded object to 0x488fca0 of slot size 7 hdr8 ..... (gdb) n 46911 aNewSpace = (&(eden())); (gdb) call printCallStack() 0x7ffffffe3f70 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe3fe0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffe4020 I [] in INVALID RECEIVER>newProcess 0x118a728 is in new space (gdb)
The VM actually crashes some time later, however the call stack shows the an invalid receiver.
Shouldn't the FP be updated to point to the moved BlockClosure?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AIJPEW2MHBCK5I4WIDZKV2LQUURAZA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEERXY4Y#issuecomment-555973747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIJPEW3CYFSVOHIYMERMOXLQUURAZANCNFSM4JNBNJHQ .
Incidentally, I played around with the VM on a big machine (>1.4TB RAM) and tried to set a large Eden Bytes count (around 100m or more); it crashed early-on.
st 20. 11. 2019 v 21:00 odesílatel Tobias Pape notifications@github.com napsal:
Incidentally, I played around with the VM on a big machine (>1.4TB RAM) and tried to set a large Eden Bytes count (around 100m or more); it crashed early-on.
yes, it is possible. I was talking about my testing setup for this crash where I have a small script that causes it very quickly so, to some extent, it generates very similar object memory every time. And in these conditions, there is clear Eden size dependency on the probability of the crash.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AIJPEW2I6HFPI43KOE3BOF3QUWJOXA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEUMOIA#issuecomment-556320544, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIJPEW743CEDELDJBKTYN73QUWJOXANCNFSM4JNBNJHQ .
Hi alistair
This is great especially because when I discussed with guille about the problem last week, he told me that they were suspected another cause than Freetype and FFI. So it looks that they hypothesis were right. (Guille is coming from fatherhood leave and he restarting to work a couple of days a week). Apparently what Pablo was looking for before leaving on vacation was if forwarder logic or something related with the write barrier and the GC was not nilling one stack element. But I may have misunderstood. Tx
Hi Stef,
Apparently what Pablo was looking for before leaving on vacation was if forwarder logic or something related with the write barrier and the GC was not nilling one stack element. But I may have misunderstood.
That aligns with one thing I'm seeing, the Frame Pointer not being updated after an object is moved and it's old copy converted to a forwarder.
However there are other assertions being printed before the Frame Pointer corruption which I don't understand yet.
It would be great if Pablo could pass on what he knows after he gets back from leave.
Thanks, Alistair
Just for the record:
``` Thread 1 "pharo" hit Breakpoint 5, scavengeReferentsOf (referrer=22055008) at /home/alistair/opensmalltalk-vm/opensmalltalk-vm/spur64src/vm/gcc3x-cointerp.c:42158 42158 assert((!(isForwarded(referrer))) 0x7ffffffa7f80 I SessionManager>launchSnapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffa7ff0 I [] in SessionManager>snapshot:andQuit: 0x1508860: a(n) SessionManager 0x7ffffffa8030 I [] in BlockClosure>newProcess 0xef01d8: a(n) BlockClosure (gdb) call printFrame(0x7ffffffa8030) 0x7ffffffa8030 is on a free page?! $2 = 0 ```
The call frame being on a free page doesn't look good.
Last week I've been trying to have a simpler way to reproduce it (still on the way BTW). @pavel-krivanek had handed us originally the following script that reproduces it with a high degree of certainty, but it depends on opening Calypso (which opens new processes...):
```smalltalk 20 timesRepeat: [ClyFullBrowser openOnMethod: SourceSansProRegular class>>#fontContents ].
1 to: 20 do: [ :i | Smalltalk snapshot: true andQuit: false ]. ``` I've been trying to reduce it into something not depending on Calypso nor Pharo's startup. I got the following script that is longer to run, and does not reproduce the problem as often as Pavel's one.
```smalltalk 2000 timesRepeat: [ [ 1000 timesRepeat: [ [ (1 to: 100000) collect: [ :e | e ** 17 ] ] forkAt: 38 ] ] fork ].
1 to: 20 do: [ :i | Smalltalk snapshotPrimitive]. ``` The good thing about this last script is that is shows also that the problem has nothing to do with the startup mechanism and whatever is running in it.
Enabling the leak checker starts shows output when snapshotting, but once the leak is produced it's too late, the objects leaking are already corrupted.
Hi Guille,
Can you try running this script from a minimal image:
``` | aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 1 to: 400 do: [ :i | Stdio stdout << 'ExternalAddress: '; print: ExternalAddress instanceCount; lf; << 'Start'; lf; flush. aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveImageInFileNamed: 'Save.', i asString, '.image'. Stdio stdout << 'Loop: '; print: i; lf; flush ] ]. ```
I'm using a minimal image from 18 Nov 2019 and it normally crashes on the 90th iteration. On my machine it takes about 50 seconds to crash. The good bit though is the addresses are the same on each run, so breakpoints can be easily set when working backwards.
``` pharo -headless Pharo.image gccrash.st ```
I could reproduce it with that too, but I would like a case that I can run in a 100kb image in the simulator :) Otherwise this is super hard to debug...
If you get it reproducible in the simulator please let me know. I'm working in gdb.
Hi Alistair,
I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes. Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10 when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.
The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?). So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug. I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here. So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change: LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names. With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2] | sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager). "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM." "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives." sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor: self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l... [2] http://www.mirandabanda.org/cogblog/build-image/ [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_... [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
On Wed, Nov 27, 2019 at 12:24 PM Alistair Grant notifications@github.com wrote:
If you get it reproducible in the simulator please let me know. I'm working in gdb.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AIJPEW42WBGG6FAUZKODAXDQVZKITA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFJGINQ#issuecomment-559047734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIJPEW6JWKIV3TOLN3WI4CTQVZKITANCNFSM4JNBNJHQ .
Hi Clément, Hi All,
On Thu, Nov 28, 2019 at 1:36 PM Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes. Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10 when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.
Yeah :-) For those unaware of what this is, it is simply the ability to repeat GC crashes in the simulator.
What we do, bu default, is every time we run a GC in the simulator we first create a complete deep copy of the VM interpreter, memory manager and heap, and then run the GC in the deep copy. If that crashes because of a bug in GC we can loop from then on, taking fresh copies of the VM state and running the GC until we understand and then fix the bug. The problem with normal GHC debugging is that the final state of the heap is so corrupted one cannot make sense of it and cannot figure out where the bug is. Here we alway preserve the initial, valid, state of the heap.
The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
Fantastic, Clément. Thank you.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?). So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug. I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap.
This is a critical bug to fix. It also blind sided me form m y own bug in the compactor. Apologies to all for my stubbornness in pointing the finger at FreeType. But the bug here is still serious.
Further, it is at least conceptually possible to simulate FFI calls, or rather simulate the marshaling of FFI calls so that FFI calls can be made within simulation. We could therefore check in the simulator for unpinned objects, etc. but we would still be vulnerable to usafe foreign code corrupting the system. Any FFI call is potentially unsafe, because it is infeasible (performance wise) to e.g. page-protect the Smalltalk heap and VM state during an FFI call, and completely impossible to do this for a threaded (non-blocking) FFI call. .
Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here. So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak checker
- swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change: LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names. With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2] | sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager). "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM." "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives." sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor: self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
:-)
Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Right. I'm committing improvements to the ImageLeakChecker first so that it detects out-of-bounds oops as created by the compactor bug. I have this working. I am just committing/generating first. Then I'll fix source generation, which is temporarily damaged by the ARMv8 work. It should be easy to fix. Then I can generate fresh VMs.
Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l... [2] http://www.mirandabanda.org/cogblog/build-image/ [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_... [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
On Wed, Nov 27, 2019 at 12:24 PM Alistair Grant notifications@github.com wrote:
If you get it reproducible in the simulator please let me know. I'm working in gdb.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AIJPEW42WBGG6FAUZKODAXDQVZKITA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFJGINQ#issuecomment-559047734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIJPEW6JWKIV3TOLN3WI4CTQVZKITANCNFSM4JNBNJHQ .
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
Hi Clément,
On Thu, 28 Nov 2019 at 22:36, Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes. Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10 when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.
We need to thank Juraj here, he was the one who produced the initial version of the script which made all of this possible.
The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).
That was unlucky timing, there was a bad commit made. I think it's largely tidied up now, still, using the current stable version isn't necessarily bad :-)
Just for future reference: the first thing I tried was reproducing it on the Pharo 8 minimal image (I did this before the formatter bug appeared and kept the same image). The minimal image has a few advantages:
- It's smaller, 14M vs. 54M, so less memory to keep track of (and the simulator will be a bit faster) - It doesn't have FreeType loaded, so that quickly ruled it out as an issue. - I wasn't sure if there would be other FFI calls, so this just reduced the chances.
So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug. I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here. So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change: LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
Also for future reference, I'm surprised you didn't hit an FFI call trying to get the current working directory. Making the following change in OSPlatform removes the FFI call:
currentWorkingDirectoryPathWithBuffer: aByteString <primitive: 'primitiveGetCurrentWorkingDirectory' module: 'UnixOSProcessPlugin' error: ec> ^self primitiveFailed
(if on windows you need to use WinOSProcessorPlugin).
4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
I put the data in to a file and loaded it :-)
5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.
We also updated the script to save to different files.
But did you actually get it to save the image in the simulator? I'm just reproducing your work now but couldn't save an image due to a bug in the FileAttributesPluginSimulator. I've got a fix and will commit a bit later.
With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2] | sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager). "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM." "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives." sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor: self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l... [2] http://www.mirandabanda.org/cogblog/build-image/ [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_... [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
You wrote in [3]:
"the slightest change in the heap might change the bug; any variability in timing or user input can result in a different heap and hence in the bug morphing or going into hiding."
This was evident in this issue. While the script (fortunately) would always produce a crash, small changes, such as how the initial JSON is loaded, or the name of the image that it is saved to, caused fairly large changes in the number of loops to trigger the crash.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
Thanks for the links, I'll keep reading.
Thanks again! Alistair
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
Hi,
On Fri, Nov 29, 2019, 10:21 Alistair Grant akgrant0710@gmail.com wrote:
Hi Clément,
On Thu, 28 Nov 2019 at 22:36, Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in
VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
Thanks for the easy reproduction! Last year when I used the GC
benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
when going to 10Gb. I fixed a few bugs on the production GC back then
(mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
on 1%, since I was benchmarking on experimental GCs with various changes
I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
reproduce :-). So I could investigate. It's so fun to do lemming
debugging in the simulator.
We need to thank Juraj here, he was the one who produced the initial version of the script which made all of this possible.
Thanks Juraj. Are you both Feenk people?
Are you mainly working on the VM Alistair? Or just having fun?
The GC bug was basically that when Planning Compactor (Production Full
GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
get confused and attempt to compact objects upward instead of downward
(address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
asZnUrl.
Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the simulator
to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in
the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the script,
both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).
That was unlucky timing, there was a bad commit made. I think it's largely tidied up now, still, using the current stable version isn't necessarily bad :-)
Just for future reference: the first thing I tried was reproducing it on the Pharo 8 minimal image (I did this before the formatter bug appeared and kept the same image). The minimal image has a few advantages:
- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an
issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.
Not having FreeType and LibGit would be nice indeed. The difference
between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast. The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.
In most cases we reproduce bugs using the Squeak REPL image. See: https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtr... I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it). You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.
So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the
bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
I then built an image which runs your script at start-up (Smalltalk
snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is
that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak
checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image
start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
Also for future reference, I'm surprised you didn't hit an FFI call trying to get the current working directory. Making the following change in OSPlatform removes the FFI call:
currentWorkingDirectoryPathWithBuffer: aByteString <primitive: 'primitiveGetCurrentWorkingDirectory' module: 'UnixOSProcessPlugin' error: ec> ^self primitiveFailed
(if on windows you need to use WinOSProcessorPlugin).
Err. Maybe I forgot to write down a few steps here and commented a few
other methods... I fixed it and then wrote the mail, I don't remember it all. I think indeed there was something accessing source or change files and I commented something in there. I'll try to check the change file later on. I don't have access to my laptop right now I'm at work so I cannot check.
4] Turns out ZnEasy does not work well in the simulator. So I preloaded
this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
I put the data in to a file and loaded it :-)
5] To shortcut simulation time, since the bug happened around the 60th
save for me, I build a different script which snapshots the image to different image names.
We also updated the script to save to different files.
But did you actually get it to save the image in the simulator? I'm just reproducing your work now but couldn't save an image due to a bug in the FileAttributesPluginSimulator. I've got a fix and will commit a bit later.
Yes, running the script in the simulator generated me around 30 images
(Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work. Then running the script again to 400 iterations from the VM I generated filled my local SSD :-). I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.
With a crash at snapshot 59 (only change file written to disk), image 57
was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
| sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
Spur64BitMemoryManager).
"sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
Set to e.g. 64 for something like the real VM."
"sis assertValidExecutionPointersAtEachStep: false." "Set this to true
to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back
1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described
in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive
compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator would
not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new
VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with
the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1]
https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l...
https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_...
[4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
You wrote in [3]:
"the slightest change in the heap might change the bug; any variability in timing or user input can result in a different heap and hence in the bug morphing or going into hiding."
This was evident in this issue. While the script (fortunately) would always produce a crash, small changes, such as how the initial JSON is loaded, or the name of the image that it is saved to, caused fairly large changes in the number of loops to trigger the crash.
Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls). In both environment user events is a problem. That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.
And this is a crash. Performance pitfalls issue are even harder to track down IMO.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted. If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.
Thanks for the links, I'll keep reading.
Thanks again! Alistair
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
Alistair,
Indeed I also had an issue with workingDirectoryPath. This is the complete list of what I changed in the Pharo 7 image, it's not clear to me if everything is required (Like the zodiac change, I ended up using the global instead):
!FreeTypeSystemSettings class methodsFor: 'settings' stamp: 'cb 11/15/2019 21:22' prior: 25673466! ft2LibraryVersion ^ Smalltalk ui theme newLabelIn: World for: self label: 'Available version: None' getEnabled: nil.! ! !FreeTypeSettings class methodsFor: 'system startup' stamp: 'cb 11/15/2019 21:24' prior: 25665613! startUp: resuming "resuming ifTrue:[ self updateFreeType ]"! ! !LGitLibrary class methodsFor: 'system startup' stamp: 'cb 11/15/2019 21:24' prior: 31291584! startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"! ! !OSPlatform methodsFor: 'accessing' stamp: 'cb 11/15/2019 21:36' prior: 53154052! currentWorkingDirectoryPath "This method calls the method getPwdViaFFI with arugement of a buffer size. By default it uses the defaultMaximumPathLength of each subclass as the buffer size." ^ '/' "self currentWorkingDirectoryPathWithBufferSize: self defaultMaximumPathLength" ! ! !ZdcPluginSSLSession methodsFor: 'initialization' stamp: 'cb 11/15/2019 21:39' prior: 66961330! initialize "Initialize the receiver" "[ handle := self primitiveSSLCreate ] on: PrimitiveFailed do: [ :exception | ZdcPluginMissing signal ]. self logging: false" ! ! !PharoFilesOpener methodsFor: 'public' stamp: 'cb 11/27/2019 22:56' prior: 54156796! changesFileOrNilReadOnly: readOnly silent: silent | changesFile | changesFile := self openChanges: self changesName readOnly: readOnly. (changesFile isNil and: [ silent not ]) ifTrue: [ self informProblemInChanges: self cannotLocateMsg ]. ^ changesFile . ! !
On Fri, Nov 29, 2019 at 11:53 AM Clément Béra bera.clement@gmail.com wrote:
Hi,
On Fri, Nov 29, 2019, 10:21 Alistair Grant akgrant0710@gmail.com wrote:
Hi Clément,
On Thu, 28 Nov 2019 at 22:36, Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in
VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
Thanks for the easy reproduction! Last year when I used the GC
benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
when going to 10Gb. I fixed a few bugs on the production GC back then
(mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
on 1%, since I was benchmarking on experimental GCs with various
changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
reproduce :-). So I could investigate. It's so fun to do lemming
debugging in the simulator.
We need to thank Juraj here, he was the one who produced the initial version of the script which made all of this possible.
Thanks Juraj. Are you both Feenk people?
Are you mainly working on the VM Alistair? Or just having fun?
The GC bug was basically that when Planning Compactor (Production Full
GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
get confused and attempt to compact objects upward instead of downward
(address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
asZnUrl.
Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the
simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in
the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the
script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).
That was unlucky timing, there was a bad commit made. I think it's largely tidied up now, still, using the current stable version isn't necessarily bad :-)
Just for future reference: the first thing I tried was reproducing it on the Pharo 8 minimal image (I did this before the formatter bug appeared and kept the same image). The minimal image has a few advantages:
- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an
issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.
Not having FreeType and LibGit would be nice indeed. The difference
between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast. The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.
In most cases we reproduce bugs using the Squeak REPL image. See:
https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtr... I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it). You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.
So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the
bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
I then built an image which runs your script at start-up (Smalltalk
snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is
that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak
checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image
start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
Also for future reference, I'm surprised you didn't hit an FFI call trying to get the current working directory. Making the following change in OSPlatform removes the FFI call:
currentWorkingDirectoryPathWithBuffer: aByteString <primitive: 'primitiveGetCurrentWorkingDirectory' module: 'UnixOSProcessPlugin' error: ec> ^self primitiveFailed
(if on windows you need to use WinOSProcessorPlugin).
Err. Maybe I forgot to write down a few steps here and commented a few
other methods... I fixed it and then wrote the mail, I don't remember it all. I think indeed there was something accessing source or change files and I commented something in there. I'll try to check the change file later on. I don't have access to my laptop right now I'm at work so I cannot check.
4] Turns out ZnEasy does not work well in the simulator. So I preloaded
this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
I put the data in to a file and loaded it :-)
5] To shortcut simulation time, since the bug happened around the 60th
save for me, I build a different script which snapshots the image to different image names.
We also updated the script to save to different files.
But did you actually get it to save the image in the simulator? I'm just reproducing your work now but couldn't save an image due to a bug in the FileAttributesPluginSimulator. I've got a fix and will commit a bit later.
Yes, running the script in the simulator generated me around 30 images
(Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work. Then running the script again to 400 iterations from the VM I generated filled my local SSD :-). I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.
With a crash at snapshot 59 (only change file written to disk), image
57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
| sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
Spur64BitMemoryManager).
"sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
Set to e.g. 64 for something like the real VM."
"sis assertValidExecutionPointersAtEachStep: false." "Set this to true
to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back
1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique
described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive
compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator
would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new
VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with
the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1]
https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l...
https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_...
[4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
You wrote in [3]:
"the slightest change in the heap might change the bug; any variability in timing or user input can result in a different heap and hence in the bug morphing or going into hiding."
This was evident in this issue. While the script (fortunately) would always produce a crash, small changes, such as how the initial JSON is loaded, or the name of the image that it is saved to, caused fairly large changes in the number of loops to trigger the crash.
Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls). In both environment user events is a problem. That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.
And this is a crash. Performance pitfalls issue are even harder to track down IMO.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted. If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.
Thanks for the links, I'll keep reading.
Thanks again! Alistair
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
Hi Clément,
(for anyone else reading this, the thread had become quite long, I've chopped out quite a bit assuming the context is already familiar)
On Fri, 29 Nov 2019 at 11:53, Clément Béra bera.clement@gmail.com wrote:
Thanks Juraj. Are you both Feenk people? Are you mainly working on the VM Alistair? Or just having fun?
Yes, we're both feenk people. :-)
I'm not focusing on the VM as such, but when we have issues with the VM I'm one of the people that tend to look at it.
Not having FreeType and LibGit would be nice indeed. The difference between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast. The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.
In most cases we reproduce bugs using the Squeak REPL image. See: https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtr... I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it). You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.
I've actually done both of these in the past, used the Squeak REPL and build a Pharo version.
Err. Maybe I forgot to write down a few steps here and commented a few other methods... I fixed it and then wrote the mail, I don't remember it all. I think indeed there was something accessing source or change files and I commented something in there. I'll try to check the change file later on. I don't have access to my laptop right now I'm at work so I cannot check.
I saw the list of changes you made, thanks. I avoided the LibGit and Zodiac issues by using the minimal image.
Yes, running the script in the simulator generated me around 30 images (Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work. Then running the script again to 400 iterations from the VM I generated filled my local SSD :-). I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.
That's the difference alright: #saveImageInFileNamed: checks that the parent directory exists first, which uses FileAttributesPlugin, while #snapshot:andQuit: doesn't do those checks.
Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls).
I think this will be more to do with the fact that Pharo has #processPreemptionYields true, while Squeak has it false. It means that every IO and timer event can effectively change the active process (if there are multiple at the same priority), so process completion is much less deterministic.
In both environment user events is a problem. That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.
Yep, I'll be reading your papers.
And this is a crash. Performance pitfalls issue are even harder to track down IMO.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted. If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.
This image didn't show any problems with validImage, but you're right, I'll need to go one image back.
Thanks! Alistair
Hi Alistair,
On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant akgrant0710@gmail.com wrote:
Hi Clément,
On Thu, 28 Nov 2019 at 22:36, Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in
VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
Thanks for the easy reproduction! Last year when I used the GC
benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
when going to 10Gb. I fixed a few bugs on the production GC back then
(mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
on 1%, since I was benchmarking on experimental GCs with various changes
I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
reproduce :-). So I could investigate. It's so fun to do lemming
debugging in the simulator.
We need to thank Juraj here, he was the one who produced the initial version of the script which made all of this possible.
The GC bug was basically that when Planning Compactor (Production Full
GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
get confused and attempt to compact objects upward instead of downward
(address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
asZnUrl.
Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the simulator
to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in
the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the script,
both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).
That was unlucky timing, there was a bad commit made. I think it's largely tidied up now, still, using the current stable version isn't necessarily bad :-)
Just for future reference: the first thing I tried was reproducing it on the Pharo 8 minimal image (I did this before the formatter bug appeared and kept the same image). The minimal image has a few advantages:
- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an
issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.
What we should have done is ran the test case using an assert VM with the leak checker turned on, running in gdb/lldb. This would have proved the bug was in GC on snapshot because - the leak check before GC on snapshot would have succeeded - the leak check immediately after GC for snapshot, but before snapshot, would have failed
It may be that the leak check would not have failed, because in investigating this bug I added a bounds check before probing the leak map, so the leak map is only probed for pointers that are within the full extent of the heap (which, because the heap is segmented, may be a much larger range than the size of the heap). But know that the leak checker is a useful tool for pinpointing heap corruption and GC bugs. The leak checker is enabled by bitwise flags to apply to various GC activities (scavenge, full GC, become, and can be extended to be run on FFI call), and when enabled runs before and after each phase.
When running an assert VM under gdb/lldb one puts a breakpoint in warning, the routine that outputs assert failure messages, and then runs an image.
When running in the simulator asserts are always run, and the leak checker can be enabled by sending a message to the interpreter's objectMemory.
So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
I then built an image which runs your script at start-up (Smalltalk
snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is
that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak
checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image
start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
Also for future reference, I'm surprised you didn't hit an FFI call trying to get the current working directory. Making the following change in OSPlatform removes the FFI call:
currentWorkingDirectoryPathWithBuffer: aByteString <primitive: 'primitiveGetCurrentWorkingDirectory' module: 'UnixOSProcessPlugin' error: ec> ^self primitiveFailed
(if on windows you need to use WinOSProcessorPlugin).
4] Turns out ZnEasy does not work well in the simulator.
Can you say more on this?
So I preloaded this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
I put the data in to a file and loaded it :-)
5] To shortcut simulation time, since the bug happened around the 60th
save for me, I build a different script which snapshots the image to different image names.
We also updated the script to save to different files.
But did you actually get it to save the image in the simulator? I'm just reproducing your work now but couldn't save an image due to a bug in the FileAttributesPluginSimulator. I've got a fix and will commit a bit later.
With a crash at snapshot 59 (only change file written to disk), image 57
was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
| sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
Spur64BitMemoryManager).
"sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
Set to e.g. 64 for something like the real VM."
"sis assertValidExecutionPointersAtEachStep: false." "Set this to true
to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back
1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described
in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive
compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator would
not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new
VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with
the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1]
https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l...
https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_...
[4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
You wrote in [3]:
"the slightest change in the heap might change the bug; any variability in timing or user input can result in a different heap and hence in the bug morphing or going into hiding."
This was evident in this issue. While the script (fortunately) would always produce a crash, small changes, such as how the initial JSON is loaded, or the name of the image that it is saved to, caused fairly large changes in the number of loops to trigger the crash.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
Thanks for the links, I'll keep reading.
Thanks again! Alistair
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
On Fri, Nov 29, 2019 at 7:28 PM Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Alistair,
On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant akgrant0710@gmail.com wrote:
Hi Clément,
On Thu, 28 Nov 2019 at 22:36, Clément Béra bera.clement@gmail.com wrote:
Hi Alistair,
I've just investigated the bug tonight and fixed it in
VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
Thanks for the easy reproduction! Last year when I used the GC
benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
when going to 10Gb. I fixed a few bugs on the production GC back then
(mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
on 1%, since I was benchmarking on experimental GCs with various
changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
reproduce :-). So I could investigate. It's so fun to do lemming
debugging in the simulator.
We need to thank Juraj here, he was the one who produced the initial version of the script which made all of this possible.
The GC bug was basically that when Planning Compactor (Production Full
GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
get confused and attempt to compact objects upward instead of downward
(address wise) on the second attempt, and that's broken and corrupts memory.
I started from this script:
| aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json'
asZnUrl.
Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ].
It makes me however very sad that you were not able to use the
simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
Therefore I'm going to explain you my process to reproduce the bug in
the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
1] The first thing I did was to reproduce your bug, based on the
script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).
That was unlucky timing, there was a bad commit made. I think it's largely tidied up now, still, using the current stable version isn't necessarily bad :-)
Just for future reference: the first thing I tried was reproducing it on the Pharo 8 minimal image (I did this before the formatter bug appeared and kept the same image). The minimal image has a few advantages:
- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an
issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.
What we should have done is ran the test case using an assert VM with the leak checker turned on, running in gdb/lldb. This would have proved the bug was in GC on snapshot because
- the leak check before GC on snapshot would have succeeded
- the leak check immediately after GC for snapshot, but before snapshot,
would have failed
It may be that the leak check would not have failed, because in investigating this bug I added a bounds check before probing the leak map, so the leak map is only probed for pointers that are within the full extent of the heap (which, because the heap is segmented, may be a much larger range than the size of the heap). But know that the leak checker is a useful tool for pinpointing heap corruption and GC bugs. The leak checker is enabled by bitwise flags to apply to various GC activities (scavenge, full GC, become, and can be extended to be run on FFI call), and when enabled runs before and after each phase.
When running an assert VM under gdb/lldb one puts a breakpoint in warning, the routine that outputs assert failure messages, and then runs an image.
When running in the simulator asserts are always run, and the leak checker can be enabled by sending a message to the interpreter's objectMemory.
So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
I then built an image which runs your script at start-up (Smalltalk
snapshot: true andQuit: true followed by your script, I select all and run do-it).
2] Then I started the image in the simulator. First thing I noticed is
that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak
checker + swizzling in simulation).
3] Then I started the image in the simulator. Turns out the image
start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"
Also for future reference, I'm surprised you didn't hit an FFI call trying to get the current working directory. Making the following change in OSPlatform removes the FFI call:
currentWorkingDirectoryPathWithBuffer: aByteString <primitive: 'primitiveGetCurrentWorkingDirectory' module: 'UnixOSProcessPlugin' error: ec> ^self primitiveFailed
(if on windows you need to use WinOSProcessorPlugin).
4] Turns out ZnEasy does not work well in the simulator.
Can you say more on this?
FFI call in Zodiac > fail in sim > avoid it if possible,
So I preloaded this line aJson := ZnEasy get: ' https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).
I put the data in to a file and loaded it :-)
5] To shortcut simulation time, since the bug happened around the 60th
save for me, I build a different script which snapshots the image to different image names.
We also updated the script to save to different files.
But did you actually get it to save the image in the simulator? I'm just reproducing your work now but couldn't save an image due to a bug in the FileAttributesPluginSimulator. I've got a fix and will commit a bit later.
With a crash at snapshot 59 (only change file written to disk), image
57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
| sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory
Spur64BitMemoryManager).
"sis desiredNumStackPages: 8." "Speeds up scavenging when simulating.
Set to e.g. 64 for something like the real VM."
"sis assertValidExecutionPointersAtEachStep: false." "Set this to true
to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back
1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique
described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
Then I quickly figured out that the GC was performing two successive
compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
6] I then compiled a VM from the sources to check Slang translator
would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
@Eliot now needs to fix tip of tree, generate the code and produce new
VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
Alistair, let me know if you have questions. I hope you can work with
the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
[1]
https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_l...
https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_...
[4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/
You wrote in [3]:
"the slightest change in the heap might change the bug; any variability in timing or user input can result in a different heap and hence in the bug morphing or going into hiding."
This was evident in this issue. While the script (fortunately) would always produce a crash, small changes, such as how the initial JSON is loaded, or the name of the image that it is saved to, caused fairly large changes in the number of loops to trigger the crash.
Also, while trying to reproduce your debug steps above, the image I have already has memory leaks, so it isn't hitting the "self assert: (self validRelocationPlanInPass: finalPass) = 0" assertion.
Thanks for the links, I'll keep reading.
Thanks again! Alistair
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- _,,,^..^,,,_ best, Eliot
This conversation is already quite long, and I need somewhere to keep more detailed information, so I've created a summary page at:
https://akgrant43.github.io/Initial-Post/
and will be adding more details.
Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix.
Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here.
I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available.
Thanks a lot, Clement!
Doru
On Nov 28, 2019, at 10:40 PM, Clement Bera notifications@github.com wrote:
Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix.
Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here.
I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
-- feenk.com
"Innovation comes in the least expected form. That is, if it is expected, it already happened."
Hi Clement,
On Thu, 28 Nov 2019 at 22:40, Clement Bera notifications@github.com wrote:
Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix.
Thanks!
I can confirm that the script completes all 400 iterations without crashing, in fact without any assertion warnings at all. We'll do some additional testing in the coming days, but I think it's safe to assume that the root cause has been solved.
Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here.
I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available.
Thanks also for your detailed write-up (sent separately). I'll go through it step by step and respond tomorrow (a bit late now :-)).
Thanks again, Alistair
Closed #444.
This was some amazing work to observe. Thank you everyone involved!
Cheers, Doru
On Nov 29, 2019, at 9:01 AM, Clement Bera notifications@github.com wrote:
Closed #444.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
-- feenk.com
"We are all great at making mistakes."
Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics. A pity that the snippet did not fail in Squeak...
Hi Nicolas,
On Sun, 1 Dec 2019 at 21:10, Nicolas Cellier notifications@github.com wrote:
Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics. A pity that the snippet did not fail in Squeak...
This was interesting in that before this year I can't remember the last time the VM crashed for me (other than my own mistakes while working on plugins). Then this year it has got to the point where it is significantly impacting our productivity. I guess it is something about the size of the images we're using that just happens to line up with the trigger conditions.
Anyway, once Juraj had made it reproducible it was worthwhile committing 100% to fixing it.
Thanks for testing it on Squeak, that's something I had on my ToDo list, so it saved me some time.
Cheers, Alistair
On Mon, Dec 2, 2019 at 11:44 AM Alistair Grant akgrant0710@gmail.com wrote:
Hi Nicolas,
On Sun, 1 Dec 2019 at 21:10, Nicolas Cellier notifications@github.com wrote:
Very good job! Thanks Alistair for your tenacity and Clement for letting
us learn the advanced technics.
A pity that the snippet did not fail in Squeak...
This was interesting in that before this year I can't remember the last time the VM crashed for me (other than my own mistakes while working on plugins). Then this year it has got to the point where it is significantly impacting our productivity. I guess it is something about the size of the images we're using that just happens to line up with the trigger conditions.
Yeah that bug would happen only when the compactor would decide to go for
a multiple passes compaction, which typically means Eden size is far smaller than old space size (Eden memory is usually used to hold first field of moved objects in the compactor).
My personal desktop has more RAM now that I used to have so I'll try later this week to run the VM on a 100+Gb heap (Maybe loading multiple times the NetBeans bench) and see how it holds the workload. There's a lot of GC tuning to do for such heaps. Right now the VM is tuned by default for heaps < 100Mb. See [1] for tips on GC tuning. I need to set-up the compilation environment for my desktop machine anyway hoping the C compiler will benefit from having more threads and more RAM to work with to lower compilation time.
Saving and reloading multi-Gbs heap is also not very good because the VM at start-up merges old space into a single memory segment. It would be better to reshape the heaps in different segments. We tried with Sophie a few years ago but it's not that easy. If you want to give it a try...
[1] https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collec...
Anyway, once Juraj had made it reproducible it was worthwhile
committing 100% to fixing it.
Thanks for testing it on Squeak, that's something I had on my ToDo list, so it saved me some time.
Cheers, Alistair
On Sun, Dec 1, 2019 at 9:10 PM Nicolas Cellier notifications@github.com wrote:
Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics. A pity that the snippet did not fail in Squeak...
Thanks Nicolas. I have to say I explained it because a while ago you
resolved a bug that happened when compiling with O2, but not with O0, and then explained how you solved it on the mailing list and I really enjoyed reading that mail. It's really nice to share knowledge here.
I would speculate the snippet did not fail in Squeak either because the Stream implementation is slightly different, or because the default settings for Eden size and other settings like that is different on Pharo. I don't know.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AAWQNAUUBMIP6YGGGHTP4DDQWQKZVA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRUEJQ#issuecomment-560153126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQNASE7ZVEUJUV4ZO4I5TQWQKZVANCNFSM4JNBNJHQ .
Hi Clément,
Yeah that bug would happen only when the compactor would decide to go for a multiple passes compaction, which typically means Eden size is far smaller than old space size (Eden memory is usually used to hold first field of moved objects in the compactor).
Our base image is around 220MB at the moment, which seems to pretty well match what you're saying.
There's a lot of GC tuning to do for such heaps.
Is there something I can read on tuning for these scenarios?
Thanks! Alistair
Doh!
Reading https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collec... now.
Assuming you're using Moose, the paragraph *Experience report >> loading a 200Mb Moose model* looks relevant :-)
If you're using a machine with much more RAM than the heap size, you can tune things so that it uses an extra fixed amount of memory (for example 50Mb extra through growth headroom and eden size) but runs faster. You can also tune the full GC ratio but you have to be more careful since the extra memory used is proportional to heap size.
The problem is that the GC cannot be tuned by default for strong machines or the VM won't start on small devices such as the Pie Nano, and the VM has to work for everyone. In the context of an IDE or software analysis, you may want to assume that the machine has at least 2Gb of RAM and don't care about saving 50Mb.
I'm sure you can set-up something so that the base settings in the Feenk image is tuned up. It's not clear you'll gain more than 10-20% though at 220Mb.
On Mon, Dec 2, 2019 at 2:03 PM Alistair Grant notifications@github.com wrote:
Doh!
Reading https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collec... now.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AAWQNAXABTPLO6XIZZRQNDLQWUBQTA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTNKEQ#issuecomment-560387346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQNATQ6LZA5GLGQWT6SWTQWUBQTANCNFSM4JNBNJHQ .
Hi Clément,
This isn't a moose image but the latest Gtoolkit image (there's plenty of room for reducing the size of the image, but that will be after functionality has stabilised).
The problem is that the GC cannot be tuned by default for strong machines or the VM won't start on small devices such as the Pie Nano, and the VM has to work for everyone.
Understood. :-)
If you're using a machine with much more RAM than the heap size, you can tune things so that it uses an extra fixed amount of memory (for example 50Mb extra through growth headroom and eden size) but runs faster. You can also tune the full GC ratio but you have to be more careful since the extra memory used is proportional to heap size.
Our development machines all have much more RAM than the heap size, so I'll play with tuning the headroom and eden sizes.
I'm sure you can set-up something so that the base settings in the Feenk image is tuned up. It's not clear you'll gain more than 10-20% though at 220Mb.
The image starts at 220Mb, but while running the virtual memory usage of the process can increase by a Gb or more, so I think the tuning will still be worthwhile.
Thanks! Alistair
vm-dev@lists.squeakfoundation.org