Hi All,
there is a VM bug in 64-bit Spur with the Sista V1 bytecode set and full blocks. The symptom is that when waiting for a remote Monticello repository to update and/or deliver a package version the system crashes in JITTED code after what appears to be some kind of wait.
This is a reliably occurring bug b ut maddeningly difficult to reproduce. The bug reliably occurs when interacting with a remote rep[ository (e.g. http://source.squeak.org/VMMaker) when the server is "cold", and hence makes the image wait. Every time I have tried to repeat the failing sequence the crash has not occurre3d, I think because the server is now "hot" and serves up the version quickly. Today I even tried shutting down my machine for over an hour and rebooting. But I could not get the crash to occur even though it seems to me that every time I try it the first time in the4 day it does crash.
This is an important bug to fix. If it cannot be fixed then full blocks and Sista V1 are not ready for use in the upcoming Squeak release. I am looking for help in debugging this.
- is anyone else uising the 64-bit VM with full blocks and Sista V1 who sees hard VM crashes? If so, under what circumstances?
- is it possible to flush caches in the http://source.squeak.org/VMMaker server, or could people tolerate me rebooting the server?
- is there a way of introducing network delays in Mac OS that might help me induce the bug?
- can anyone think of any other strategies I might take to try and reproduce this?
I may have to try and reproduce e the bug in the simulator to have a chance of identifying the bug. Does anyone have a good enough mental model of the Monticello server interaction and have energy to help me figure this one out?
Here is some information from the last crash I did see in the debugger (alas it is incomplete; there are a number of additional pieces of info I could have collected).
(lldb) thr b
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
* frame #0: 0x000000010de5700a
frame #1: 0x000000010dd7b174
frame #2: 0x000000010dd45f1c
frame #3: 0x000000010dd44534
frame #4: 0x000000010dd44c60
(lldb) x/10i 0x000000010de5700a
(lldb) call printStackCallStackOf($rbp)
0x7ffeefbdfc30 M Heap>upHeap: 0x11273ca90: a(n) Heap
0x7ffeefbdfc68 M Heap>add: 0x11273ca90: a(n) Heap
0x7ffeefbdfca0 M Delay class>scheduleDelay:from: 0x1123ebfb8: a(n) Delay class
0x7ffeefbdfcf0 M Delay class>handleTimerEvent 0x1123ebfb8: a(n) Delay class
0x7ffeefbdfd20 M Delay class>runTimerEventLoop 0x1123ebfb8: a(n) Delay class
(lldb) x/10i 0x000000010dd7b174
0x10dd7b174: 48 8b 55 10 movq 0x10(%rbp), %rdx
0x10dd7b178: 48 89 ec movq %rbp, %rsp
0x10dd7b17b: 5d popq %rbp
0x10dd7b17c: c2 10 00 retq $0x10
0x10dd7b17f: cc int3
0x10dd7b180: cc int3
0x10dd7b181: cc int3
0x10dd7b182: cc int3
0x10dd7b183: cc int3
0x10dd7b184: cc int3
(lldb) print whereIs(0x000000010dd7b174)
(char *) $0 = 0x00000001000f83ff " is in generated methods"
(lldb) call printCogMethodFor((void *)0x000000010dd7b174)
0x10dd7afc0 <-> 0x10dd7b198: method: 0x112f23c10 selector: 0x112232c20 add:
(lldb) print whereIs(0x000000010de5700a)
(char *) $1 = 0x00000001000f83ff " is in generated methods"
(lldb) call printCogMethodFor((void *)0x000000010de5700a)
0x10de56ba0 <-> 0x10de57078: method: 0x1126ec218 prim 23856 selector: 0x7ffeefbf3d20
this method ends up being the fitted version of Delay class>> startTimerEventLoop _,,,^..^,,,_ best, Eliot
It sounds like simulating a terrible (dropping packets) or painfully slow (kbps speeds) network connection might be a more controlled way to reproduce the issue: https://apple.stackexchange.com/questions/24066/how-to-simulate-slow-interne...
On Fri, Sep 13, 2019 at 11:16 PM Eliot Miranda eliot.miranda@gmail.com wrote:
Hi All,
there is a VM bug in 64-bit Spur with the Sista V1 bytecode set and
full blocks. The symptom is that when waiting for a remote Monticello repository to update and/or deliver a package version the system crashes in JITTED code after what appears to be some kind of wait.
This is a reliably occurring bug b ut maddeningly difficult to reproduce. The bug reliably occurs when interacting with a remote rep[ository (e.g. http://source.squeak.org/VMMaker) when the server is "cold", and hence makes the image wait. Every time I have tried to repeat the failing sequence the crash has not occurre3d, I think because the server is now "hot" and serves up the version quickly. Today I even tried shutting down my machine for over an hour and rebooting. But I could not get the crash to occur even though it seems to me that every time I try it the first time in the4 day it does crash.
This is an important bug to fix. If it cannot be fixed then full blocks and Sista V1 are not ready for use in the upcoming Squeak release. I am looking for help in debugging this.
- is anyone else uising the 64-bit VM with full blocks and Sista V1 who
sees hard VM crashes? If so, under what circumstances?
- is it possible to flush caches in the http://source.squeak.org/VMMaker
server, or could people tolerate me rebooting the server?
- is there a way of introducing network delays in Mac OS that might help
me induce the bug?
- can anyone think of any other strategies I might take to try and
reproduce this?
I may have to try and reproduce e the bug in the simulator to have a chance of identifying the bug. Does anyone have a good enough mental model of the Monticello server interaction and have energy to help me figure this one out?
Here is some information from the last crash I did see in the debugger (alas it is incomplete; there are a number of additional pieces of info I could have collected).
(lldb) thr b
- thread #1, queue = 'com.apple.main-thread', stop reason =
EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
frame #0: 0x000000010de5700a
frame #1: 0x000000010dd7b174
frame #2: 0x000000010dd45f1c
frame #3: 0x000000010dd44534
frame #4: 0x000000010dd44c60
(lldb) x/10i 0x000000010de5700a
(lldb) call printStackCallStackOf($rbp)
0x7ffeefbdfc30 M Heap>upHeap: 0x11273ca90: a(n) Heap 0x7ffeefbdfc68 M Heap>add: 0x11273ca90: a(n) Heap 0x7ffeefbdfca0 M Delay class>scheduleDelay:from: 0x1123ebfb8: a(n)
Delay class
0x7ffeefbdfcf0 M Delay class>handleTimerEvent 0x1123ebfb8: a(n) Delay
class
0x7ffeefbdfd20 M Delay class>runTimerEventLoop 0x1123ebfb8: a(n) Delay
class
(lldb) x/10i 0x000000010dd7b174
0x10dd7b174: 48 8b 55 10 movq 0x10(%rbp), %rdx 0x10dd7b178: 48 89 ec movq %rbp, %rsp 0x10dd7b17b: 5d popq %rbp 0x10dd7b17c: c2 10 00 retq $0x10 0x10dd7b17f: cc int3 0x10dd7b180: cc int3 0x10dd7b181: cc int3 0x10dd7b182: cc int3 0x10dd7b183: cc int3 0x10dd7b184: cc int3
(lldb) print whereIs(0x000000010dd7b174)
(char *) $0 = 0x00000001000f83ff " is in generated methods"
(lldb) call printCogMethodFor((void *)0x000000010dd7b174)
0x10dd7afc0 <-> 0x10dd7b198: method: 0x112f23c10
selector: 0x112232c20 add:
(lldb) print whereIs(0x000000010de5700a)
(char *) $1 = 0x00000001000f83ff " is in generated methods"
(lldb) call printCogMethodFor((void *)0x000000010de5700a)
0x10de56ba0 <-> 0x10de57078: method: 0x1126ec218 prim
23856 selector: 0x7ffeefbf3d20
this method ends up being the fitted version of Delay class>> startTimerEventLoop _,,,^..^,,,_ best, Eliot
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x00000001000731dc Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x0000000100073255 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 * frame #0: 0x0000000100073255 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 23 frame #10: 0x00007fff4297ce7f CoreFoundation`__CFRunLoopDoObservers + 527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific + 483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + 366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication
doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x00000001000731dc
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f CoreFoundation`__CFRunLoopDoObservers +
527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific + 483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + 366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
I noticed that many crash happen in testSendTimeout. This test is organizing a race: - Smalltalk fill the socket send buffer - the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
The access to OS ressources and the race may also explain some randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication
doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x00000001000731dc
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f CoreFoundation`__CFRunLoopDoObservers +
527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific + 483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + 366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
On Wed, 25 Sep 2019 at 15:32, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
The access to OS ressources and the race may also explain some randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
How that might be done spiked my interest, so I had a poke around... * https://serverfault.com/questions/799605/set-large-buffer-queue-on-a-network... * https://coderwall.com/p/wuwoja/simulate-network-latency-to-debug-connection-... * https://stackoverflow.com/questions/614795/simulate-delayed-and-dropped-pack...
But those might not directly affect the send buffer, so maybe an alternative... * https://utcc.utoronto.ca/~cks/space/blog/linux/TCPSendbufferDefaultSize * https://stackoverflow.com/questions/47350028/how-to-tune-linux-network-buffe... * https://www.cyberciti.biz/faq/linux-tcp-tuning/
cheers -ben
P.S. Probably not specifically useful, but Oh Wow! just-too-impressive and more-than-you-could-possible-want-to-know about Linux networking... https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networki...
Thanks Ben, nice references! I opted for a simple reduction of TestCase timeout and/or Smalltalk buffer size, and could easily obtain a TestCase timeout on linux. But unfortunately, no crash, no rr... I have reduced the crash to testSendTimeout on macos, something like
Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 20 timesRepeat: [[(SocketTest selector: #testSendTimeout) runCase] on: TestFailure do: [:exc|]].
Hoping that narrowing will improve repeatability...
Le mer. 25 sept. 2019 à 14:58, Ben Coman btc@openinworld.com a écrit :
On Wed, 25 Sep 2019 at 15:32, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
The access to OS ressources and the race may also explain some randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
How that might be done spiked my interest, so I had a poke around...
https://serverfault.com/questions/799605/set-large-buffer-queue-on-a-network...
https://coderwall.com/p/wuwoja/simulate-network-latency-to-debug-connection-...
https://stackoverflow.com/questions/614795/simulate-delayed-and-dropped-pack...
But those might not directly affect the send buffer, so maybe an alternative...
https://stackoverflow.com/questions/47350028/how-to-tune-linux-network-buffe...
cheers -ben
P.S. Probably not specifically useful, but Oh Wow! just-too-impressive and more-than-you-could-possible-want-to-know about Linux networking...
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networki...
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x00000001000731dc
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33 frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f CoreFoundation`__CFRunLoopDoObservers
- 527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific +
483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + 366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a
écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
Hi,
> > - can anyone think of any other strategies I might take to try and > reproduce this? > at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 34 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010012de84 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 36 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 35 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 * frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33 frame #0: 0x00000001000731dc Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33 frame #0: 0x0000000100073255 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f CoreFoundation`__CFRunLoopDoObservers
- 527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific +
483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode + 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + 366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a
écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote:
Hi Jan, Thanks, a recording debugger would indeed be a tremendous help! Unfortunately, I'm unable to reproduce the bug under linux, only macosx... Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a écrit :
> Hi, > >> >> - can anyone think of any other strategies I might take to try and >> reproduce this? >> > at the risk of stating the obvious, in cases similar to this rr (+ > VDB) > helped > me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE > > Never used with opensmalltalk VM, though. > > HTH, Jan
-- _,,,^..^,,,_ best, Eliot
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication
setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 34 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010012de84
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 36 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 35 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33 frame #0: 0x00000001000731dc Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33 frame #0: 0x0000000100073255 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder + 360 frame #9: 0x00007fff4297cf57 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f
CoreFoundation`__CFRunLoopDoObservers + 527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific
- 483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode
- 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon +
366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a
écrit :
On 2019-09-22 10:57, Nicolas Cellier wrote: > Hi Jan, > Thanks, a recording debugger would indeed be a tremendous help! > Unfortunately, I'm unable to reproduce the bug under linux, only > macosx... > Do you know something equivalent to rr for mac?
No, I don't. There might be something though, I never had a mac...
Jan
> > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a > écrit : > >> Hi, >> >>> >>> - can anyone think of any other strategies I might take to try and >>> reproduce this? >>> >> at the risk of stating the obvious, in cases similar to this rr (+ >> VDB) >> helped >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >> >> Never used with opensmalltalk VM, though. >> >> HTH, Jan
-- _,,,^..^,,,_ best, Eliot
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 à 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication
setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 34 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010012de84
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 36 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 35 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Follow up: so we have a problem of repeatability... I found nothing like rr for macos, and CPU watchpoint capability is too restricted to be usefull.
So what I did is save a crash.image attempting to crash at resumption time: Smalltalk saveAs: 'crash'. Smalltalk snapshot: true andQuit: true. 5 timesRepeat: [SocketTest suite run].
Then I launched the vm under lldb and gathered $pc stats to see if I could observe some reproducibility (lldb) p/x $pc - methodZoneBase 0x00000000000a0e94 0x0000000000090f70 0x000000000009f824 0x000000000009bfc8 0x000000000009d030 0x000000000009c7c0 0x00000000000951e0 0x00000000000949c8 <- once 0x0000000000094418 0x00000000000949c8 <— twice 0x000000000009b848 0x00000000000a0e94 <- twice 0x000000000009a176 0x000000006dfff2e0 ??? 0x00000000000996d1 0x00000000000949c8 <- thrice good candidate for watchpoint set 0x000000000009ae70 0x000000000009ae70 0x00000000000a18e7 0x0000000000095470 0x000000000009c458 0x00000000000a0a7c 0x0000000000093e89 0x000000000009beb0 0x0000000000095a37 0x0000000000095408 0x00000000000a06bc 0x000000000009cd48 0x0000000000095408 <- twice 0x000000000009ba41 0x0000000000095408 <- thrice new candidate for watchpoint set 0x0000000000045258 <— smallest ever 0x000000000009c7c0 0x0000000000095408 <- fourth ! the watch point was set here, why didn’t it stop??? argh methodZoneBase was incremented after I set the watchpoint!!! 0x00000000000a12dc 0x00000000000951e0 0x00000000000a0e24 0x000000000009ff8c 0x000000000009c7f6 0x000000000009a626
Then I have chosen to watch the most frequent crash address, that is methodZoneBase+0x0000000000095408, Unfortunately, s commented above, I missed one (I set the watchpoint too early, before methodZoneBase took its definitive address).
And after a few (a bunch of) trials, here is what I found:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 28175 exited with status = 9 (0x00000009) Process 28185 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' (x86_64) Squeak was compiled with optimization - stepping may behave oddly; variables may not be available. Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason =
breakpoint 6.1 frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at sqSqueakOSXApplication.m:125 [opt] 122 //No super call here as we've already set headless from command line or info.plist 123 extern BOOL gSqueakHeadless; 124 // Notice that setActivationPolicy: is available in OSX 10.6 and later -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) { 126 if (gSqueakHeadless) { 127 [NSApp setActivationPolicy:NSApplicationActivationPolicyProhibited]; 128 } else { Target 0: (Squeak) stopped. (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) cont Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910324 new value: -3689348814741910836 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason =
watchpoint 33 frame #0: 0x00000001000731dc Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6320 [opt] 6317 sqInt rawHeader; 6318 6319 (method->cmType = CMMethod); -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 Target 0: (Squeak) stopped. (lldb) Process 28185 resuming
Watchpoint 33 hit: old value: -3689348814741910836 new value: -3689571654825360894 Process 28185 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason =
watchpoint 33 frame #0: 0x0000000100073255 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] 6336 (method->selector = selector); 6337 (method->cmNumArgs = argumentCountOfMethodHeader(methodHeader)); 6338 (method->cmHasMovableLiteral = hasMovableLiteral); -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { 6340 addToYoungReferrers(method); 6341 } 6342 (method->cmUsageCount = initialMethodUsageCount()); Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason =
watchpoint 33
- frame #0: 0x0000000100073255
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, selector=4435118304) at cogitX64SysV.c:6339 [opt] frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] frame #2: 0x000000010004f7a5 Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] frame #3: 0x000000010004bc54 Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at cogitX64SysV.c:4833 [opt] frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at gcc3x-cointerp.c:27237 [opt] frame #5: 0x0000000107bd1925 frame #6: 0x000000010000236b Squeak`interpret at gcc3x-cointerp.c:2767 [opt] frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at sqSqueakMainApplication.m:201 [opt] frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder
- 360 frame #9: 0x00007fff4297cf57
CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__
- 23 frame #10: 0x00007fff4297ce7f
CoreFoundation`__CFRunLoopDoObservers + 527 frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + 1240 frame #12: 0x00007fff4295ec93 CoreFoundation`CFRunLoopRunSpecific
- 483 frame #13: 0x00007fff41c49d96 HIToolbox`RunCurrentEventLoopInMode
- 286 frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon +
366 frame #15: 0x00007fff41c49884 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 (lldb)
Here I don't know how to proceed, because I have no knowledge of this part of the VM, it's also quite late now... Eliot, does it ring some bell?
Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a
écrit :
> > On 2019-09-22 10:57, Nicolas Cellier wrote: > > Hi Jan, > > Thanks, a recording debugger would indeed be a tremendous help! > > Unfortunately, I'm unable to reproduce the bug under linux, only > > macosx... > > Do you know something equivalent to rr for mac? > > No, I don't. There might be something though, I never had a mac... > > Jan > > > > > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz a > > écrit : > > > >> Hi, > >> > >>> > >>> - can anyone think of any other strategies I might take to try > and > >>> reproduce this? > >>> > >> at the risk of stating the obvious, in cases similar to this rr (+ > >> VDB) > >> helped > >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE > >> > >> Never used with opensmalltalk VM, though. > >> > >> HTH, Jan >
-- _,,,^..^,,,_ best, Eliot
I confirm that each crash follows a SIGIO -> forceInterruptCheck() with corrupted stack pointer pointing to methodZone (that is executable zone so program counter). I did: 1) remove static declaration of methodZoneBase and limitAddress in cogitX64Sysv.c so as to make them exported globals 2) declared them as extern sqInt in gcc-cointerp.c 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a breakpoint:
if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { printf("pb in forceInterruptCheck\n"); }
Unlike what I said, forceInterruptCheck() does nothing special and certainly not invoke Smalltalk code nor JITTED code. The interruptCheckChain function that it could invoke is generally null (except for debug/simulation purpose), and always null in the crash i observed. The signal handler also generally operate on same stack (unless specifically asked via sigaction() - which is not the case). My earler diagnostic was driven by the false assumption that handler would use the interrupt stack pointer (IST), and thus that only special stack pointer manipulation happening in trampoline could corrupt it, hmmm that was late night blurred minded ;)
What comes to my mind: - the signal SIGIO might happen while we are manipulating the stack pointer, hence the heisenbug - the stack pointer could be offseted by some incorrect code (but why no error would happen before SIGIO then?)
I don't have clear understanding of all spur machinery details that would help further diagnosis Eliot, doesn't it sound interesting?
Le jeu. 3 oct. 2019 à 08:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 à 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 34 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010012de84
Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 36 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 35 frame #0: 0x0000000100102067
Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint 37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
> Follow up: so we have a problem of repeatability... > I found nothing like rr for macos, and CPU watchpoint capability is > too restricted to be usefull. > > So what I did is save a crash.image attempting to crash at > resumption time: > Smalltalk saveAs: 'crash'. > Smalltalk snapshot: true andQuit: true. > 5 timesRepeat: [SocketTest suite run]. > > Then I launched the vm under lldb and gathered $pc stats to see if I > could observe some reproducibility > (lldb) p/x $pc - methodZoneBase > 0x00000000000a0e94 > 0x0000000000090f70 > 0x000000000009f824 > 0x000000000009bfc8 > 0x000000000009d030 > 0x000000000009c7c0 > 0x00000000000951e0 > 0x00000000000949c8 <- once > 0x0000000000094418 > 0x00000000000949c8 <— twice > 0x000000000009b848 > 0x00000000000a0e94 <- twice > 0x000000000009a176 > 0x000000006dfff2e0 ??? > 0x00000000000996d1 > 0x00000000000949c8 <- thrice good candidate for watchpoint set > 0x000000000009ae70 > 0x000000000009ae70 > 0x00000000000a18e7 > 0x0000000000095470 > 0x000000000009c458 > 0x00000000000a0a7c > 0x0000000000093e89 > 0x000000000009beb0 > 0x0000000000095a37 > 0x0000000000095408 > 0x00000000000a06bc > 0x000000000009cd48 > 0x0000000000095408 <- twice > 0x000000000009ba41 > 0x0000000000095408 <- thrice new candidate for watchpoint set > 0x0000000000045258 <— smallest ever > 0x000000000009c7c0 > 0x0000000000095408 <- fourth ! the watch point was set here, why > didn’t it stop??? argh methodZoneBase was incremented after I set the > watchpoint!!! > 0x00000000000a12dc > 0x00000000000951e0 > 0x00000000000a0e24 > 0x000000000009ff8c > 0x000000000009c7f6 > 0x000000000009a626 > > Then I have chosen to watch the most frequent crash address, > that is methodZoneBase+0x0000000000095408, > Unfortunately, s commented above, I missed one (I set the watchpoint > too early, before methodZoneBase took its definitive address). > > And after a few (a bunch of) trials, here is what I found: > > (lldb) run ../../image/crash.image > There is a running process, kill it and restart?: [Y/n] > Process 28175 exited with status = 9 (0x00000009) > Process 28185 launched: > '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' > (x86_64) > Squeak was compiled with optimization - stepping may behave oddly; > variables may not be available. > Process 28185 stopped > * thread #1, queue = 'com.apple.main-thread', stop reason = > breakpoint 6.1 > frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication > doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at > sqSqueakOSXApplication.m:125 [opt] > 122 //No super call here as we've already set headless from > command line or info.plist > 123 extern BOOL gSqueakHeadless; > 124 // Notice that setActivationPolicy: is available in OSX > 10.6 and later > -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) > { > 126 if (gSqueakHeadless) { > 127 [NSApp > setActivationPolicy:NSApplicationActivationPolicyProhibited]; > 128 } else { > Target 0: (Squeak) stopped. > (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 > Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 state > = enabled type = w > new value: -3689348814741910324 > (lldb) cont > Process 28185 resuming > > Watchpoint 33 hit: > old value: -3689348814741910324 > new value: -3689348814741910836 > Process 28185 stopped > * thread #1, queue = 'com.apple.main-thread', stop reason = > watchpoint 33 > frame #0: 0x00000001000731dc > Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > selector=4435118304) at cogitX64SysV.c:6320 [opt] > 6317 sqInt rawHeader; > 6318 > 6319 (method->cmType = CMMethod); > -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); > 6321 (method->blockSize = size); > 6322 (method->methodObject = methodObj); > 6323 > Target 0: (Squeak) stopped. > (lldb) > Process 28185 resuming > > Watchpoint 33 hit: > old value: -3689348814741910836 > new value: -3689571654825360894 > Process 28185 stopped > * thread #1, queue = 'com.apple.main-thread', stop reason = > watchpoint 33 > frame #0: 0x0000000100073255 > Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > selector=4435118304) at cogitX64SysV.c:6339 [opt] > 6336 (method->selector = selector); > 6337 (method->cmNumArgs = > argumentCountOfMethodHeader(methodHeader)); > 6338 (method->cmHasMovableLiteral = hasMovableLiteral); > -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { > 6340 addToYoungReferrers(method); > 6341 } > 6342 (method->cmUsageCount = initialMethodUsageCount()); > Target 0: (Squeak) stopped. > (lldb) bt > * thread #1, queue = 'com.apple.main-thread', stop reason = > watchpoint 33 > * frame #0: 0x0000000100073255 > Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > selector=4435118304) at cogitX64SysV.c:6339 [opt] > frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] > generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] > frame #2: 0x000000010004f7a5 > Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] > frame #3: 0x000000010004bc54 > Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at > cogitX64SysV.c:4833 [opt] > frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at > gcc3x-cointerp.c:27237 [opt] > frame #5: 0x0000000107bd1925 > frame #6: 0x000000010000236b Squeak`interpret at > gcc3x-cointerp.c:2767 [opt] > frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication > runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at > sqSqueakMainApplication.m:201 [opt] > frame #8: 0x00007fff44b077b8 Foundation`__NSFirePerformWithOrder > + 360 > frame #9: 0x00007fff4297cf57 > CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ > + 23 > frame #10: 0x00007fff4297ce7f > CoreFoundation`__CFRunLoopDoObservers + 527 > frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + > 1240 > frame #12: 0x00007fff4295ec93 > CoreFoundation`CFRunLoopRunSpecific + 483 > frame #13: 0x00007fff41c49d96 > HIToolbox`RunCurrentEventLoopInMode + 286 > frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon + > 366 > frame #15: 0x00007fff41c49884 > HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 > frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 > frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) > _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 > frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 > frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 > frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 > (lldb) > > Here I don't know how to proceed, because I have no knowledge of > this part of the VM, it's also quite late now... > Eliot, does it ring some bell? > > Ah, forget it, I thought it was overwriting, but it is the first write on yet virgin methodeZone... (it's just that the scenario was different from all the previous run)
Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a > écrit : > >> >> On 2019-09-22 10:57, Nicolas Cellier wrote: >> > Hi Jan, >> > Thanks, a recording debugger would indeed be a tremendous help! >> > Unfortunately, I'm unable to reproduce the bug under linux, only >> > macosx... >> > Do you know something equivalent to rr for mac? >> >> No, I don't. There might be something though, I never had a mac... >> >> Jan >> >> > >> > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz >> a >> > écrit : >> > >> >> Hi, >> >> >> >>> >> >>> - can anyone think of any other strategies I might take to try >> and >> >>> reproduce this? >> >>> >> >> at the risk of stating the obvious, in cases similar to this rr >> (+ >> >> VDB) >> >> helped >> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >> >> >> >> Never used with opensmalltalk VM, though. >> >> >> >> HTH, Jan >> >
-- _,,,^..^,,,_ best, Eliot
If I comment /* signal(SIGIO,forceInterruptCheck) */ in platforms//unix/vm/aio.c then I don't experiment crashes anymore
One solution if we want to manipulate the rsp safely is - to define a specific stack for signal handling via sigaltstack() - to use that specific stack via sigaction() with SA_ONSTACK flag
That's too low level and I prefer some assistance here. There are several calls to signal in the VM so they should all be reviewed.
Le jeu. 3 oct. 2019 à 22:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
I confirm that each crash follows a SIGIO -> forceInterruptCheck() with corrupted stack pointer pointing to methodZone (that is executable zone so program counter). I did:
- remove static declaration of methodZoneBase and limitAddress in
cogitX64Sysv.c so as to make them exported globals 2) declared them as extern sqInt in gcc-cointerp.c 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a breakpoint:
if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { printf("pb in forceInterruptCheck\n"); }
Unlike what I said, forceInterruptCheck() does nothing special and certainly not invoke Smalltalk code nor JITTED code. The interruptCheckChain function that it could invoke is generally null (except for debug/simulation purpose), and always null in the crash i observed. The signal handler also generally operate on same stack (unless specifically asked via sigaction() - which is not the case). My earler diagnostic was driven by the false assumption that handler would use the interrupt stack pointer (IST), and thus that only special stack pointer manipulation happening in trampoline could corrupt it, hmmm that was late night blurred minded ;)
What comes to my mind:
- the signal SIGIO might happen while we are manipulating the stack
pointer, hence the heisenbug
- the stack pointer could be offseted by some incorrect code (but why no
error would happen before SIGIO then?)
I don't have clear understanding of all spur machinery details that would help further diagnosis Eliot, doesn't it sound interesting?
Le jeu. 3 oct. 2019 à 08:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 à 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
34 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010012de84 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
36 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
35 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
> > > Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < > nicolas.cellier.aka.nice@gmail.com> a écrit : > >> Follow up: so we have a problem of repeatability... >> I found nothing like rr for macos, and CPU watchpoint capability is >> too restricted to be usefull. >> >> So what I did is save a crash.image attempting to crash at >> resumption time: >> Smalltalk saveAs: 'crash'. >> Smalltalk snapshot: true andQuit: true. >> 5 timesRepeat: [SocketTest suite run]. >> >> Then I launched the vm under lldb and gathered $pc stats to see if >> I could observe some reproducibility >> (lldb) p/x $pc - methodZoneBase >> 0x00000000000a0e94 >> 0x0000000000090f70 >> 0x000000000009f824 >> 0x000000000009bfc8 >> 0x000000000009d030 >> 0x000000000009c7c0 >> 0x00000000000951e0 >> 0x00000000000949c8 <- once >> 0x0000000000094418 >> 0x00000000000949c8 <— twice >> 0x000000000009b848 >> 0x00000000000a0e94 <- twice >> 0x000000000009a176 >> 0x000000006dfff2e0 ??? >> 0x00000000000996d1 >> 0x00000000000949c8 <- thrice good candidate for watchpoint set >> 0x000000000009ae70 >> 0x000000000009ae70 >> 0x00000000000a18e7 >> 0x0000000000095470 >> 0x000000000009c458 >> 0x00000000000a0a7c >> 0x0000000000093e89 >> 0x000000000009beb0 >> 0x0000000000095a37 >> 0x0000000000095408 >> 0x00000000000a06bc >> 0x000000000009cd48 >> 0x0000000000095408 <- twice >> 0x000000000009ba41 >> 0x0000000000095408 <- thrice new candidate for watchpoint set >> 0x0000000000045258 <— smallest ever >> 0x000000000009c7c0 >> 0x0000000000095408 <- fourth ! the watch point was set here, why >> didn’t it stop??? argh methodZoneBase was incremented after I set the >> watchpoint!!! >> 0x00000000000a12dc >> 0x00000000000951e0 >> 0x00000000000a0e24 >> 0x000000000009ff8c >> 0x000000000009c7f6 >> 0x000000000009a626 >> >> Then I have chosen to watch the most frequent crash address, >> that is methodZoneBase+0x0000000000095408, >> Unfortunately, s commented above, I missed one (I set the >> watchpoint too early, before methodZoneBase took its definitive address). >> >> And after a few (a bunch of) trials, here is what I found: >> >> (lldb) run ../../image/crash.image >> There is a running process, kill it and restart?: [Y/n] >> Process 28175 exited with status = 9 (0x00000009) >> Process 28185 launched: >> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' >> (x86_64) >> Squeak was compiled with optimization - stepping may behave oddly; >> variables may not be available. >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> breakpoint 6.1 >> frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication >> doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at >> sqSqueakOSXApplication.m:125 [opt] >> 122 //No super call here as we've already set headless from >> command line or info.plist >> 123 extern BOOL gSqueakHeadless; >> 124 // Notice that setActivationPolicy: is available in OSX >> 10.6 and later >> -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) >> { >> 126 if (gSqueakHeadless) { >> 127 [NSApp >> setActivationPolicy:NSApplicationActivationPolicyProhibited]; >> 128 } else { >> Target 0: (Squeak) stopped. >> (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 >> Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 >> state = enabled type = w >> new value: -3689348814741910324 >> (lldb) cont >> Process 28185 resuming >> >> Watchpoint 33 hit: >> old value: -3689348814741910324 >> new value: -3689348814741910836 >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> frame #0: 0x00000001000731dc >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6320 [opt] >> 6317 sqInt rawHeader; >> 6318 >> 6319 (method->cmType = CMMethod); >> -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); >> 6321 (method->blockSize = size); >> 6322 (method->methodObject = methodObj); >> 6323 >> Target 0: (Squeak) stopped. >> (lldb) >> Process 28185 resuming >> >> Watchpoint 33 hit: >> old value: -3689348814741910836 >> new value: -3689571654825360894 >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> frame #0: 0x0000000100073255 >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6339 [opt] >> 6336 (method->selector = selector); >> 6337 (method->cmNumArgs = >> argumentCountOfMethodHeader(methodHeader)); >> 6338 (method->cmHasMovableLiteral = hasMovableLiteral); >> -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { >> 6340 addToYoungReferrers(method); >> 6341 } >> 6342 (method->cmUsageCount = initialMethodUsageCount()); >> Target 0: (Squeak) stopped. >> (lldb) bt >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> * frame #0: 0x0000000100073255 >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6339 [opt] >> frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] >> generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] >> frame #2: 0x000000010004f7a5 >> Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] >> frame #3: 0x000000010004bc54 >> Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at >> cogitX64SysV.c:4833 [opt] >> frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at >> gcc3x-cointerp.c:27237 [opt] >> frame #5: 0x0000000107bd1925 >> frame #6: 0x000000010000236b Squeak`interpret at >> gcc3x-cointerp.c:2767 [opt] >> frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication >> runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at >> sqSqueakMainApplication.m:201 [opt] >> frame #8: 0x00007fff44b077b8 >> Foundation`__NSFirePerformWithOrder + 360 >> frame #9: 0x00007fff4297cf57 >> CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ >> + 23 >> frame #10: 0x00007fff4297ce7f >> CoreFoundation`__CFRunLoopDoObservers + 527 >> frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + >> 1240 >> frame #12: 0x00007fff4295ec93 >> CoreFoundation`CFRunLoopRunSpecific + 483 >> frame #13: 0x00007fff41c49d96 >> HIToolbox`RunCurrentEventLoopInMode + 286 >> frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon >> + 366 >> frame #15: 0x00007fff41c49884 >> HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 >> frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 >> frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) >> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 >> frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 >> frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 >> frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 >> (lldb) >> >> Here I don't know how to proceed, because I have no knowledge of >> this part of the VM, it's also quite late now... >> Eliot, does it ring some bell? >> >> > Ah, forget it, I thought it was overwriting, but it is the first > write on yet virgin methodeZone... > (it's just that the scenario was different from all the previous run) > > Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a >> écrit : >> >>> >>> On 2019-09-22 10:57, Nicolas Cellier wrote: >>> > Hi Jan, >>> > Thanks, a recording debugger would indeed be a tremendous help! >>> > Unfortunately, I'm unable to reproduce the bug under linux, only >>> > macosx... >>> > Do you know something equivalent to rr for mac? >>> >>> No, I don't. There might be something though, I never had a mac... >>> >>> Jan >>> >>> > >>> > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz >>> a >>> > écrit : >>> > >>> >> Hi, >>> >> >>> >>> >>> >>> - can anyone think of any other strategies I might take to try >>> and >>> >>> reproduce this? >>> >>> >>> >> at the risk of stating the obvious, in cases similar to this rr >>> (+ >>> >> VDB) >>> >> helped >>> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >>> >> >>> >> Never used with opensmalltalk VM, though. >>> >> >>> >> HTH, Jan >>> >>
-- _,,,^..^,,,_ best, Eliot
Hi Nicolas,
I am only reading your (very impressive) emails here, and not really looking at the code. But for whatever it may be worth:
I am not surprised if OS X and Linux behave differently with respect to signal delivery and handling. OS X started with a Mach kernel that was based on lightweight threads, and it evolved to be a flavor of unix with a unix process model based on threads. Linux started as a unix clone with unix processes that were gradually evolved to act like threads. Nowadays they are probably more or less the same in most practical respects, but the mechanisms for delivering signals to threads and/or processes might be different in the details.
IIUC, your hypothesis is that signals are being delivered to threads at some point at which the state of the thread stack is somehow not consistent with what is expected (somehow related to competing uses of stacks in the interpreter and jit frames). That makes sense to me.
You suggest a possible solution of defining a specific stack for signal handling via sigaltstack() and using that stack via sigaction() with SA_ONSTACK flag. That is probably the right thing to do.
One reason I think it may be the right thing to do is that Eliot and/or I did something very similar for the signal forwarding mechanism in OSProcessPlugin. Checking the old commit notices (because I do not really recall the details), I added SA_ONSTACK to deal with some Cog (stack intepreter) issues back in 2012.
Here is the commit notice that explains the issue at that time:
Name: VMConstruction-Plugins-OSProcessPlugin-dtl.37 Author: dtl Time: 22 December 2012, 3:08:51.228 pm UUID: e3c06c1b-76d8-4025-9d13-5fe582618c62 Ancestors: VMConstruction-Plugins-OSProcessPlugin-dtl.36
OSProcessPlugin 4.5.1
Use an alternate signal stack to avoid consuming native stack space. This prevents stack usage problems when running on Cog (stack depth exceeded when a signal is being handled). Adapted from VMConstruction-Plugins-OSProcessPlugin.oscog-eem.38 with the following changes:
- Always use sigaltstack if the platform supports it (i.e. do not test for Cog). - If SQ_ONSTACK is not defined, then assume the platform cannot support signalstack and use the default stack. Use macros to define away unsupportable code. - Fix some type declarations to address int/pointer comparison - Use #isDefined:inSmalltalk:comment:ifTrue:ifFalse:
And for reference here is the generated C code from UnixOSProcessPlugin>>setSignalNumber:handler (my original version, oscog is different but the same in essence). It sets SA_ONSTACK for the signal stack, and also SA_RESTART because the VM code generally cannot be expected to handle restarting interrupted primitives rigorously, therefore use the simple default.
/* Set a signal handler. The C code translator will convert #sig:nal: into 'signal(parm1, parm2)' */ static void * setSignalNumberhandler(sqInt signalNumber, void *signalHandlerAddress) { struct sigaction oldHandlerAction; struct sigaction sigHandlerAction; # ifdef SA_ONSTACK // true if platform supports sigaltstack sigHandlerAction.sa_sigaction = signalHandlerAddress; sigHandlerAction.sa_flags = SA_ONSTACK | SA_RESTART; sigemptyset(&sigHandlerAction.sa_mask); if ((sigaction(signalNumber, (&sigHandlerAction), (&oldHandlerAction))) != 0) { perror("signal"); } return oldHandlerAction.sa_sigaction; # else return signal(signalNumber, signalHandlerAddress); # endif // SA_ONSTACK return null; }
Dave
On Thu, Oct 03, 2019 at 10:47:03PM +0200, Nicolas Cellier wrote:
If I comment /* signal(SIGIO,forceInterruptCheck) */ in platforms//unix/vm/aio.c then I don't experiment crashes anymore
One solution if we want to manipulate the rsp safely is
- to define a specific stack for signal handling via sigaltstack()
- to use that specific stack via sigaction() with SA_ONSTACK flag
That's too low level and I prefer some assistance here. There are several calls to signal in the VM so they should all be reviewed.
Le jeu. 3 oct. 2019 ?? 22:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a ??crit :
I confirm that each crash follows a SIGIO -> forceInterruptCheck() with corrupted stack pointer pointing to methodZone (that is executable zone so program counter). I did:
- remove static declaration of methodZoneBase and limitAddress in
cogitX64Sysv.c so as to make them exported globals 2) declared them as extern sqInt in gcc-cointerp.c 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a breakpoint:
if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { printf("pb in forceInterruptCheck\n"); }
Unlike what I said, forceInterruptCheck() does nothing special and certainly not invoke Smalltalk code nor JITTED code. The interruptCheckChain function that it could invoke is generally null (except for debug/simulation purpose), and always null in the crash i observed. The signal handler also generally operate on same stack (unless specifically asked via sigaction() - which is not the case). My earler diagnostic was driven by the false assumption that handler would use the interrupt stack pointer (IST), and thus that only special stack pointer manipulation happening in trampoline could corrupt it, hmmm that was late night blurred minded ;)
What comes to my mind:
- the signal SIGIO might happen while we are manipulating the stack
pointer, hence the heisenbug
- the stack pointer could be offseted by some incorrect code (but why no
error would happen before SIGIO then?)
I don't have clear understanding of all spur machinery details that would help further diagnosis Eliot, doesn't it sound interesting?
Le jeu. 3 oct. 2019 ?? 08:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a ??crit :
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 ?? 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a ??crit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 ?? 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a ??crit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
34 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010012de84 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
36 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
35 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 ?? 01:28, Eliot Miranda eliot.miranda@gmail.com a ??crit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
> > I noticed that many crash happen in testSendTimeout. > This test is organizing a race: > - Smalltalk fill the socket send buffer > - the OS tries to drain > > The other thing I noticed is that [SocketTest suite run] takes 7 to > 8s on macos while only 2 to 3s on ubuntu and windows > So it might be that some test times out on macos, while it doesn't on > other OS. > > This would explain that many crash also happen in JITted Timer loop, > and that we cannot observe it in other OSes. >
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some > randomization of the crash... > > So one idea would be to make the test timeout in linux too, see if we > can make it crash then try using rr. > Maybe it's possible with a huge Socket buffer, a smaller image side > buffer (we could reduce the size from 1000 to 1 so as to increase overhead). >
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
> > It's just theories I tried to construct while biking, not in front of > a keyboard, and I won't be able to verify until this evening... > But the watchpoint strategy is too much vain shooting in the dark, we > need more ellaborated plans ;) >
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 ?? 00:28, Nicolas Cellier < > nicolas.cellier.aka.nice@gmail.com> a ??crit : > >> >> >> Le mar. 24 sept. 2019 ?? 00:25, Nicolas Cellier < >> nicolas.cellier.aka.nice@gmail.com> a ??crit : >> >>> Follow up: so we have a problem of repeatability... >>> I found nothing like rr for macos, and CPU watchpoint capability is >>> too restricted to be usefull. >>> >>> So what I did is save a crash.image attempting to crash at >>> resumption time: >>> Smalltalk saveAs: 'crash'. >>> Smalltalk snapshot: true andQuit: true. >>> 5 timesRepeat: [SocketTest suite run]. >>> >>> Then I launched the vm under lldb and gathered $pc stats to see if >>> I could observe some reproducibility >>> (lldb) p/x $pc - methodZoneBase >>> 0x00000000000a0e94 >>> 0x0000000000090f70 >>> 0x000000000009f824 >>> 0x000000000009bfc8 >>> 0x000000000009d030 >>> 0x000000000009c7c0 >>> 0x00000000000951e0 >>> 0x00000000000949c8 <- once >>> 0x0000000000094418 >>> 0x00000000000949c8 <??? twice >>> 0x000000000009b848 >>> 0x00000000000a0e94 <- twice >>> 0x000000000009a176 >>> 0x000000006dfff2e0 ??? >>> 0x00000000000996d1 >>> 0x00000000000949c8 <- thrice good candidate for watchpoint set >>> 0x000000000009ae70 >>> 0x000000000009ae70 >>> 0x00000000000a18e7 >>> 0x0000000000095470 >>> 0x000000000009c458 >>> 0x00000000000a0a7c >>> 0x0000000000093e89 >>> 0x000000000009beb0 >>> 0x0000000000095a37 >>> 0x0000000000095408 >>> 0x00000000000a06bc >>> 0x000000000009cd48 >>> 0x0000000000095408 <- twice >>> 0x000000000009ba41 >>> 0x0000000000095408 <- thrice new candidate for watchpoint set >>> 0x0000000000045258 <??? smallest ever >>> 0x000000000009c7c0 >>> 0x0000000000095408 <- fourth ! the watch point was set here, why >>> didn???t it stop??? argh methodZoneBase was incremented after I set the >>> watchpoint!!! >>> 0x00000000000a12dc >>> 0x00000000000951e0 >>> 0x00000000000a0e24 >>> 0x000000000009ff8c >>> 0x000000000009c7f6 >>> 0x000000000009a626 >>> >>> Then I have chosen to watch the most frequent crash address, >>> that is methodZoneBase+0x0000000000095408, >>> Unfortunately, s commented above, I missed one (I set the >>> watchpoint too early, before methodZoneBase took its definitive address). >>> >>> And after a few (a bunch of) trials, here is what I found: >>> >>> (lldb) run ../../image/crash.image >>> There is a running process, kill it and restart?: [Y/n] >>> Process 28175 exited with status = 9 (0x00000009) >>> Process 28185 launched: >>> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' >>> (x86_64) >>> Squeak was compiled with optimization - stepping may behave oddly; >>> variables may not be available. >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> breakpoint 6.1 >>> frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication >>> doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at >>> sqSqueakOSXApplication.m:125 [opt] >>> 122 //No super call here as we've already set headless from >>> command line or info.plist >>> 123 extern BOOL gSqueakHeadless; >>> 124 // Notice that setActivationPolicy: is available in OSX >>> 10.6 and later >>> -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) >>> { >>> 126 if (gSqueakHeadless) { >>> 127 [NSApp >>> setActivationPolicy:NSApplicationActivationPolicyProhibited]; >>> 128 } else { >>> Target 0: (Squeak) stopped. >>> (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 >>> Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 >>> state = enabled type = w >>> new value: -3689348814741910324 >>> (lldb) cont >>> Process 28185 resuming >>> >>> Watchpoint 33 hit: >>> old value: -3689348814741910324 >>> new value: -3689348814741910836 >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> frame #0: 0x00000001000731dc >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6320 [opt] >>> 6317 sqInt rawHeader; >>> 6318 >>> 6319 (method->cmType = CMMethod); >>> -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); >>> 6321 (method->blockSize = size); >>> 6322 (method->methodObject = methodObj); >>> 6323 >>> Target 0: (Squeak) stopped. >>> (lldb) >>> Process 28185 resuming >>> >>> Watchpoint 33 hit: >>> old value: -3689348814741910836 >>> new value: -3689571654825360894 >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> frame #0: 0x0000000100073255 >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6339 [opt] >>> 6336 (method->selector = selector); >>> 6337 (method->cmNumArgs = >>> argumentCountOfMethodHeader(methodHeader)); >>> 6338 (method->cmHasMovableLiteral = hasMovableLiteral); >>> -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { >>> 6340 addToYoungReferrers(method); >>> 6341 } >>> 6342 (method->cmUsageCount = initialMethodUsageCount()); >>> Target 0: (Squeak) stopped. >>> (lldb) bt >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> * frame #0: 0x0000000100073255 >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6339 [opt] >>> frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] >>> generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] >>> frame #2: 0x000000010004f7a5 >>> Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] >>> frame #3: 0x000000010004bc54 >>> Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at >>> cogitX64SysV.c:4833 [opt] >>> frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at >>> gcc3x-cointerp.c:27237 [opt] >>> frame #5: 0x0000000107bd1925 >>> frame #6: 0x000000010000236b Squeak`interpret at >>> gcc3x-cointerp.c:2767 [opt] >>> frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication >>> runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at >>> sqSqueakMainApplication.m:201 [opt] >>> frame #8: 0x00007fff44b077b8 >>> Foundation`__NSFirePerformWithOrder + 360 >>> frame #9: 0x00007fff4297cf57 >>> CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ >>> + 23 >>> frame #10: 0x00007fff4297ce7f >>> CoreFoundation`__CFRunLoopDoObservers + 527 >>> frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + >>> 1240 >>> frame #12: 0x00007fff4295ec93 >>> CoreFoundation`CFRunLoopRunSpecific + 483 >>> frame #13: 0x00007fff41c49d96 >>> HIToolbox`RunCurrentEventLoopInMode + 286 >>> frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon >>> + 366 >>> frame #15: 0x00007fff41c49884 >>> HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 >>> frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 >>> frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) >>> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 >>> frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 >>> frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 >>> frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 >>> (lldb) >>> >>> Here I don't know how to proceed, because I have no knowledge of >>> this part of the VM, it's also quite late now... >>> Eliot, does it ring some bell? >>> >>> >> Ah, forget it, I thought it was overwriting, but it is the first >> write on yet virgin methodeZone... >> (it's just that the scenario was different from all the previous run) >> >> Le dim. 22 sept. 2019 ?? 11:21, Jan Vrany jan.vrany@fit.cvut.cz a >>> ??crit : >>> >>>> >>>> On 2019-09-22 10:57, Nicolas Cellier wrote: >>>> > Hi Jan, >>>> > Thanks, a recording debugger would indeed be a tremendous help! >>>> > Unfortunately, I'm unable to reproduce the bug under linux, only >>>> > macosx... >>>> > Do you know something equivalent to rr for mac? >>>> >>>> No, I don't. There might be something though, I never had a mac... >>>> >>>> Jan >>>> >>>> > >>>> > Le dim. 22 sept. 2019 ?? 09:55, Jan Vrany jan.vrany@fit.cvut.cz >>>> a >>>> > ??crit : >>>> > >>>> >> Hi, >>>> >> >>>> >>> >>>> >>> - can anyone think of any other strategies I might take to try >>>> and >>>> >>> reproduce this? >>>> >>> >>>> >> at the risk of stating the obvious, in cases similar to this rr >>>> (+ >>>> >> VDB) >>>> >> helped >>>> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >>>> >> >>>> >> Never used with opensmalltalk VM, though. >>>> >> >>>> >> HTH, Jan >>>> >>>
-- _,,,^..^,,,_ best, Eliot
On Thu, Oct 3, 2019 at 1:47 PM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
If I comment /* signal(SIGIO,forceInterruptCheck) */ in platforms//unix/vm/aio.c then I don't experiment crashes anymore
One solution if we want to manipulate the rsp safely is
- to define a specific stack for signal handling via sigaltstack()
- to use that specific stack via sigaction() with SA_ONSTACK flag
That's the right solution as far as I understand these issues.
That's too low level and I prefer some assistance here. There are several calls to signal in the VM so they should all be reviewed.
Right, I will do so today.
Le jeu. 3 oct. 2019 à 22:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
I confirm that each crash follows a SIGIO -> forceInterruptCheck() with corrupted stack pointer pointing to methodZone (that is executable zone so program counter). I did:
- remove static declaration of methodZoneBase and limitAddress in
cogitX64Sysv.c so as to make them exported globals 2) declared them as extern sqInt in gcc-cointerp.c 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a breakpoint:
if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { printf("pb in forceInterruptCheck\n"); }
Unlike what I said, forceInterruptCheck() does nothing special and certainly not invoke Smalltalk code nor JITTED code. The interruptCheckChain function that it could invoke is generally null (except for debug/simulation purpose), and always null in the crash i observed. The signal handler also generally operate on same stack (unless specifically asked via sigaction() - which is not the case). My earler diagnostic was driven by the false assumption that handler would use the interrupt stack pointer (IST), and thus that only special stack pointer manipulation happening in trampoline could corrupt it, hmmm that was late night blurred minded ;)
What comes to my mind:
- the signal SIGIO might happen while we are manipulating the stack
pointer, hence the heisenbug
- the stack pointer could be offseted by some incorrect code (but why no
error would happen before SIGIO then?)
I don't have clear understanding of all spur machinery details that would help further diagnosis Eliot, doesn't it sound interesting?
Le jeu. 3 oct. 2019 à 08:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 à 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
34 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010012de84 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
36 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
35 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
> > I noticed that many crash happen in testSendTimeout. > This test is organizing a race: > - Smalltalk fill the socket send buffer > - the OS tries to drain > > The other thing I noticed is that [SocketTest suite run] takes 7 to > 8s on macos while only 2 to 3s on ubuntu and windows > So it might be that some test times out on macos, while it doesn't > on other OS. > > This would explain that many crash also happen in JITted Timer loop, > and that we cannot observe it in other OSes. >
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some > randomization of the crash... > > So one idea would be to make the test timeout in linux too, see if > we can make it crash then try using rr. > Maybe it's possible with a huge Socket buffer, a smaller image side > buffer (we could reduce the size from 1000 to 1 so as to increase overhead). >
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
> > It's just theories I tried to construct while biking, not in front > of a keyboard, and I won't be able to verify until this evening... > But the watchpoint strategy is too much vain shooting in the dark, > we need more ellaborated plans ;) >
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier < > nicolas.cellier.aka.nice@gmail.com> a écrit : > >> >> >> Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < >> nicolas.cellier.aka.nice@gmail.com> a écrit : >> >>> Follow up: so we have a problem of repeatability... >>> I found nothing like rr for macos, and CPU watchpoint capability >>> is too restricted to be usefull. >>> >>> So what I did is save a crash.image attempting to crash at >>> resumption time: >>> Smalltalk saveAs: 'crash'. >>> Smalltalk snapshot: true andQuit: true. >>> 5 timesRepeat: [SocketTest suite run]. >>> >>> Then I launched the vm under lldb and gathered $pc stats to see if >>> I could observe some reproducibility >>> (lldb) p/x $pc - methodZoneBase >>> 0x00000000000a0e94 >>> 0x0000000000090f70 >>> 0x000000000009f824 >>> 0x000000000009bfc8 >>> 0x000000000009d030 >>> 0x000000000009c7c0 >>> 0x00000000000951e0 >>> 0x00000000000949c8 <- once >>> 0x0000000000094418 >>> 0x00000000000949c8 <— twice >>> 0x000000000009b848 >>> 0x00000000000a0e94 <- twice >>> 0x000000000009a176 >>> 0x000000006dfff2e0 ??? >>> 0x00000000000996d1 >>> 0x00000000000949c8 <- thrice good candidate for watchpoint set >>> 0x000000000009ae70 >>> 0x000000000009ae70 >>> 0x00000000000a18e7 >>> 0x0000000000095470 >>> 0x000000000009c458 >>> 0x00000000000a0a7c >>> 0x0000000000093e89 >>> 0x000000000009beb0 >>> 0x0000000000095a37 >>> 0x0000000000095408 >>> 0x00000000000a06bc >>> 0x000000000009cd48 >>> 0x0000000000095408 <- twice >>> 0x000000000009ba41 >>> 0x0000000000095408 <- thrice new candidate for watchpoint set >>> 0x0000000000045258 <— smallest ever >>> 0x000000000009c7c0 >>> 0x0000000000095408 <- fourth ! the watch point was set here, why >>> didn’t it stop??? argh methodZoneBase was incremented after I set the >>> watchpoint!!! >>> 0x00000000000a12dc >>> 0x00000000000951e0 >>> 0x00000000000a0e24 >>> 0x000000000009ff8c >>> 0x000000000009c7f6 >>> 0x000000000009a626 >>> >>> Then I have chosen to watch the most frequent crash address, >>> that is methodZoneBase+0x0000000000095408, >>> Unfortunately, s commented above, I missed one (I set the >>> watchpoint too early, before methodZoneBase took its definitive address). >>> >>> And after a few (a bunch of) trials, here is what I found: >>> >>> (lldb) run ../../image/crash.image >>> There is a running process, kill it and restart?: [Y/n] >>> Process 28175 exited with status = 9 (0x00000009) >>> Process 28185 launched: >>> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' >>> (x86_64) >>> Squeak was compiled with optimization - stepping may behave oddly; >>> variables may not be available. >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> breakpoint 6.1 >>> frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication >>> doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at >>> sqSqueakOSXApplication.m:125 [opt] >>> 122 //No super call here as we've already set headless from >>> command line or info.plist >>> 123 extern BOOL gSqueakHeadless; >>> 124 // Notice that setActivationPolicy: is available in OSX >>> 10.6 and later >>> -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) >>> { >>> 126 if (gSqueakHeadless) { >>> 127 [NSApp >>> setActivationPolicy:NSApplicationActivationPolicyProhibited]; >>> 128 } else { >>> Target 0: (Squeak) stopped. >>> (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 >>> Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 >>> state = enabled type = w >>> new value: -3689348814741910324 >>> (lldb) cont >>> Process 28185 resuming >>> >>> Watchpoint 33 hit: >>> old value: -3689348814741910324 >>> new value: -3689348814741910836 >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> frame #0: 0x00000001000731dc >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6320 [opt] >>> 6317 sqInt rawHeader; >>> 6318 >>> 6319 (method->cmType = CMMethod); >>> -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); >>> 6321 (method->blockSize = size); >>> 6322 (method->methodObject = methodObj); >>> 6323 >>> Target 0: (Squeak) stopped. >>> (lldb) >>> Process 28185 resuming >>> >>> Watchpoint 33 hit: >>> old value: -3689348814741910836 >>> new value: -3689571654825360894 >>> Process 28185 stopped >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> frame #0: 0x0000000100073255 >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6339 [opt] >>> 6336 (method->selector = selector); >>> 6337 (method->cmNumArgs = >>> argumentCountOfMethodHeader(methodHeader)); >>> 6338 (method->cmHasMovableLiteral = hasMovableLiteral); >>> -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { >>> 6340 addToYoungReferrers(method); >>> 6341 } >>> 6342 (method->cmUsageCount = initialMethodUsageCount()); >>> Target 0: (Squeak) stopped. >>> (lldb) bt >>> * thread #1, queue = 'com.apple.main-thread', stop reason = >>> watchpoint 33 >>> * frame #0: 0x0000000100073255 >>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >>> selector=4435118304) at cogitX64SysV.c:6339 [opt] >>> frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] >>> generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] >>> frame #2: 0x000000010004f7a5 >>> Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] >>> frame #3: 0x000000010004bc54 >>> Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at >>> cogitX64SysV.c:4833 [opt] >>> frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at >>> gcc3x-cointerp.c:27237 [opt] >>> frame #5: 0x0000000107bd1925 >>> frame #6: 0x000000010000236b Squeak`interpret at >>> gcc3x-cointerp.c:2767 [opt] >>> frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication >>> runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at >>> sqSqueakMainApplication.m:201 [opt] >>> frame #8: 0x00007fff44b077b8 >>> Foundation`__NSFirePerformWithOrder + 360 >>> frame #9: 0x00007fff4297cf57 >>> CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ >>> + 23 >>> frame #10: 0x00007fff4297ce7f >>> CoreFoundation`__CFRunLoopDoObservers + 527 >>> frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + >>> 1240 >>> frame #12: 0x00007fff4295ec93 >>> CoreFoundation`CFRunLoopRunSpecific + 483 >>> frame #13: 0x00007fff41c49d96 >>> HIToolbox`RunCurrentEventLoopInMode + 286 >>> frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon >>> + 366 >>> frame #15: 0x00007fff41c49884 >>> HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 >>> frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 >>> frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) >>> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 >>> frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 >>> frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 >>> frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 >>> (lldb) >>> >>> Here I don't know how to proceed, because I have no knowledge of >>> this part of the VM, it's also quite late now... >>> Eliot, does it ring some bell? >>> >>> >> Ah, forget it, I thought it was overwriting, but it is the first >> write on yet virgin methodeZone... >> (it's just that the scenario was different from all the previous >> run) >> >> Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a >>> écrit : >>> >>>> >>>> On 2019-09-22 10:57, Nicolas Cellier wrote: >>>> > Hi Jan, >>>> > Thanks, a recording debugger would indeed be a tremendous help! >>>> > Unfortunately, I'm unable to reproduce the bug under linux, only >>>> > macosx... >>>> > Do you know something equivalent to rr for mac? >>>> >>>> No, I don't. There might be something though, I never had a mac... >>>> >>>> Jan >>>> >>>> > >>>> > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz >>>> a >>>> > écrit : >>>> > >>>> >> Hi, >>>> >> >>>> >>> >>>> >>> - can anyone think of any other strategies I might take to >>>> try and >>>> >>> reproduce this? >>>> >>> >>>> >> at the risk of stating the obvious, in cases similar to this >>>> rr (+ >>>> >> VDB) >>>> >> helped >>>> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >>>> >> >>>> >> Never used with opensmalltalk VM, though. >>>> >> >>>> >> HTH, Jan >>>> >>>
-- _,,,^..^,,,_ best, Eliot
Hi Nicolas,
On Thu, Oct 3, 2019 at 1:10 PM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I confirm that each crash follows a SIGIO -> forceInterruptCheck() with corrupted stack pointer pointing to methodZone (that is executable zone so program counter). I did:
- remove static declaration of methodZoneBase and limitAddress in
cogitX64Sysv.c so as to make them exported globals 2) declared them as extern sqInt in gcc-cointerp.c 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a breakpoint:
if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { printf("pb in forceInterruptCheck\n"); }
Unlike what I said, forceInterruptCheck() does nothing special and certainly not invoke Smalltalk code nor JITTED code. The interruptCheckChain function that it could invoke is generally null (except for debug/simulation purpose), and always null in the crash i observed. The signal handler also generally operate on same stack (unless specifically asked via sigaction() - which is not the case). My earler diagnostic was driven by the false assumption that handler would use the interrupt stack pointer (IST), and thus that only special stack pointer manipulation happening in trampoline could corrupt it, hmmm that was late night blurred minded ;)
What comes to my mind:
- the signal SIGIO might happen while we are manipulating the stack
pointer, hence the heisenbug
- the stack pointer could be offseted by some incorrect code (but why no
error would happen before SIGIO then?)
I don't have clear understanding of all spur machinery details that would help further diagnosis Eliot, doesn't it sound interesting?
It looks to me like the right cause. The corruption I was seeing in the methodSZOne doesn't;t look like a bug in the JIT; it looks like overwriting. I had thought we were using sigaltstack to ensure the signal handler runs on a dedicated stack, but it appears we don't. If I'm right that we're not using sigaltstack then this may be very easy to fix :-)
[I have to mention ALSA and its awful SIGIO sign al handling code; see platforms/unix//vm-sound-ALSA/sqUnixSoundALSA.c. I don't think this affects the Mac, but it could be an issue on Linux]
Le jeu. 3 oct. 2019 à 08:10, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Maybe I draw conclusions too quickly. But clearly, the signal handler has a stack pointer pointing in methodZone.
(lldb) p/x $rsp (unsigned long) $211 = 0x0000000107c92da0 (lldb) p/x methodZoneBase (sqInt) $212 = 0x0000000107c01d30 (lldb) p/x limitAddress (usqInt) $213 = 0x0000000107d5e000
It remains to see how that happened.
Le jeu. 3 oct. 2019 à 05:49, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi all, So I finally got it! That's our specific handling of SIGIO with forceInterruptCheck. Don't you JIT the signal handler!
The signal handler is using a different stack pointer so as to not interfere with the program. Until Cog decide to jump into JITTED code from within the signal handler. It now share a stack pointer pointing somewhere in cog methodZone, and anything it does with it's stack (call a function, save registers...) will destroy the JITTED code. It even happened when I used the debugger which shares the interrupt handler stack pointer!
Why does it not happen in linux, and why only Mac OS 64, I have no clue... But we can't decently live very long with such handling.
Le mer. 2 oct. 2019 à 23:29, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> a écrit :
Hi Eliot, so I have set watchpoint to 4 previous crash and got this:
(lldb) run ../../image/crash.image There is a running process, kill it and restart?: [Y/n] Process 98150 exited with status = 9 (0x00000009) Process 98164 launched: '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' (x86_64) Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint
1.1 frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at sqSqueakMainApplication.m:123 120 121 - (void) setupMenus { 122 // nothing to do so far since the menu is setup in the MainMenu.nib file -> 123 } 124 125 - (void) setupTimers { 126 extern void SetUpTimers(void); Target 0: (Squeak) stopped. (lldb) wa se e -s 1 -- 0x107c992b4 Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x107c929e0 Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c945df Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = enabled type = w (lldb) wa se e -s 1 -- 0x0000000107c92da3 Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = enabled type = w new value: -3689348814741910324 (lldb) c Process 98164 resuming
Watchpoint 34 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
34 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689348814741910324 new value: -3689912642157150198 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010012de84 Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, selector=4435376352) at cogitX64SysV.c:6321 6318 6319 (method->cmType = CMMethod); 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); -> 6321 (method->blockSize = size); 6322 (method->methodObject = methodObj); 6323 6324 /* If the method has already been cogged (e.g. Newspeak accessors) then Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 36 hit: new value: -3689348814741910520 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
36 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=8) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 35 hit: new value: -3689348814741910456 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
35 frame #0: 0x0000000100102067 Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, val=72) at sqMemoryAccess.h:136 133 /* Use static inline functions when the compiler produces efficient code for small accessors. 134 These are preferred because static type checking will prevent inadvertent confusion of pointers and oops. */ 135 static inline sqInt byteAtPointer(char *ptr) { return (sqInt)(*((unsigned char *)ptr)); } -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } 137 static inline sqInt shortAtPointer(char *ptr) { return (sqInt)(*((short *)ptr)); } 138 static inline sqInt shortAtPointerput(char *ptr, int val) { return (sqInt)(*((short *)ptr)= (short)val); } 139 static inline sqInt intAtPointer(char *ptr) { return (sqInt)(*((int *)ptr)); } Target 0: (Squeak) stopped. (lldb) Process 98164 resuming
Watchpoint 37 hit: old value: -3689912642157150198 new value: -5066732099722018553 Process 98164 stopped
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37 frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at gcc3x-cointerp.c:64151 64148 /* StackInterpreter>>#forceInterruptCheck */ 64149 sqInt 64150 forceInterruptCheck(void) -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT 64152 void (*iccFunc)(); 64153 StackPage *thePage; 64154 Target 0: (Squeak) stopped. (lldb) bt
- thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint
37
- frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at
gcc3x-cointerp.c:64151 frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 frame #2: 0x0000000107c10699 frame #3: 0x0000000107c9646f frame #4: 0x0000000107c9633e frame #5: 0x0000000107c962ac frame #6: 0x0000000107c961e9 frame #7: 0x0000000107c8fcf8 frame #8: 0x0000000107c98605 frame #9: 0x0000000107c8bbfe frame #10: 0x0000000107c1100e frame #11: 0x0000000107c8c4ad frame #12: 0x0000000107c0cef6 frame #13: 0x0000000107c8c2a5 frame #14: 0x0000000107c8bb4c frame #15: 0x0000000107c0cef6 frame #16: 0x0000000107c8ba05 frame #17: 0x0000000107c8b85b frame #18: 0x0000000107c1100e frame #19: 0x0000000107c8b7e1 frame #20: 0x0000000107c57018 frame #21: 0x0000000107c018c8 (lldb) call printCallStack()
Watchpoint 34 hit: old value: -3689348814741910456 new value: 4435376368 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation. (lldb) call printCogMethodFor((void*)0x0000000107c10699)
Watchpoint 34 hit: old value: 4435376368 new value: 0 error: Execution was interrupted, reason: watchpoint 34. The process has been returned to the state before expression evaluation.
I don't know how to proceed, but maybe there's enough clue for you...
Le lun. 30 sept. 2019 à 01:28, Eliot Miranda eliot.miranda@gmail.com a écrit :
Hi Nicolas,
On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
I noticed that many crash happen in testSendTimeout. This test is organizing a race:
- Smalltalk fill the socket send buffer
- the OS tries to drain
The other thing I noticed is that [SocketTest suite run] takes 7 to 8s on macos while only 2 to 3s on ubuntu and windows So it might be that some test times out on macos, while it doesn't on other OS.
This would explain that many crash also happen in JITted Timer loop, and that we cannot observe it in other OSes.
Yes, that is, I think, an important clue. The other one is that, given that the crash always happens only early in a run, plus the address at which it crashes in machine code, suggests to me that the crash occurs when the interpreter invokes a block in the JITted startTimerLoop.
What worries me is that the JITted code for startTimerLoop looks completely corrupted, not merely moved or overwritten (the method header looks OK up to the selector field, which is the last field in the header). So I'm wondering whether it could be a bug in support code overwriting machine code, and hence that not might be related to a 32-bit vs 64-bit code issue. I was hoping your recent sqUnixSocket.c changes might work but I'm sure they have no bearing.
The access to OS ressources and the race may also explain some
randomization of the crash...
So one idea would be to make the test timeout in linux too, see if we can make it crash then try using rr. Maybe it's possible with a huge Socket buffer, a smaller image side buffer (we could reduce the size from 1000 to 1 so as to increase overhead).
Yes, that's a good idea. But I do think we need to set up a reliable remote server, som thing that acts like an infinite sink for receiving a par4ticular large Monticello package, so that ev very time the example runs the same package is indeed copied to a remote server. Would it be possible to construct such a fake Monticello server?
What I'm wanting the server to do is a) take a while to respond to the Monticello package contents request b) to never include a particular package which the example would always try and upload
The test case would be to copy a particular version of VMMaker from package-cache to the remote "fake" server
It's just theories I tried to construct while biking, not in front of a keyboard, and I won't be able to verify until this evening... But the watchpoint strategy is too much vain shooting in the dark, we need more ellaborated plans ;)
Does anyone see the crash on macOS in 32 bits?
Le mar. 24 sept. 2019 à 00:28, Nicolas Cellier <
nicolas.cellier.aka.nice@gmail.com> a écrit :
> > > Le mar. 24 sept. 2019 à 00:25, Nicolas Cellier < > nicolas.cellier.aka.nice@gmail.com> a écrit : > >> Follow up: so we have a problem of repeatability... >> I found nothing like rr for macos, and CPU watchpoint capability is >> too restricted to be usefull. >> >> So what I did is save a crash.image attempting to crash at >> resumption time: >> Smalltalk saveAs: 'crash'. >> Smalltalk snapshot: true andQuit: true. >> 5 timesRepeat: [SocketTest suite run]. >> >> Then I launched the vm under lldb and gathered $pc stats to see if >> I could observe some reproducibility >> (lldb) p/x $pc - methodZoneBase >> 0x00000000000a0e94 >> 0x0000000000090f70 >> 0x000000000009f824 >> 0x000000000009bfc8 >> 0x000000000009d030 >> 0x000000000009c7c0 >> 0x00000000000951e0 >> 0x00000000000949c8 <- once >> 0x0000000000094418 >> 0x00000000000949c8 <— twice >> 0x000000000009b848 >> 0x00000000000a0e94 <- twice >> 0x000000000009a176 >> 0x000000006dfff2e0 ??? >> 0x00000000000996d1 >> 0x00000000000949c8 <- thrice good candidate for watchpoint set >> 0x000000000009ae70 >> 0x000000000009ae70 >> 0x00000000000a18e7 >> 0x0000000000095470 >> 0x000000000009c458 >> 0x00000000000a0a7c >> 0x0000000000093e89 >> 0x000000000009beb0 >> 0x0000000000095a37 >> 0x0000000000095408 >> 0x00000000000a06bc >> 0x000000000009cd48 >> 0x0000000000095408 <- twice >> 0x000000000009ba41 >> 0x0000000000095408 <- thrice new candidate for watchpoint set >> 0x0000000000045258 <— smallest ever >> 0x000000000009c7c0 >> 0x0000000000095408 <- fourth ! the watch point was set here, why >> didn’t it stop??? argh methodZoneBase was incremented after I set the >> watchpoint!!! >> 0x00000000000a12dc >> 0x00000000000951e0 >> 0x00000000000a0e24 >> 0x000000000009ff8c >> 0x000000000009c7f6 >> 0x000000000009a626 >> >> Then I have chosen to watch the most frequent crash address, >> that is methodZoneBase+0x0000000000095408, >> Unfortunately, s commented above, I missed one (I set the >> watchpoint too early, before methodZoneBase took its definitive address). >> >> And after a few (a bunch of) trials, here is what I found: >> >> (lldb) run ../../image/crash.image >> There is a running process, kill it and restart?: [Y/n] >> Process 28175 exited with status = 9 (0x00000009) >> Process 28185 launched: >> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' >> (x86_64) >> Squeak was compiled with optimization - stepping may behave oddly; >> variables may not be available. >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> breakpoint 6.1 >> frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication >> doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at >> sqSqueakOSXApplication.m:125 [opt] >> 122 //No super call here as we've already set headless from >> command line or info.plist >> 123 extern BOOL gSqueakHeadless; >> 124 // Notice that setActivationPolicy: is available in OSX >> 10.6 and later >> -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) >> { >> 126 if (gSqueakHeadless) { >> 127 [NSApp >> setActivationPolicy:NSApplicationActivationPolicyProhibited]; >> 128 } else { >> Target 0: (Squeak) stopped. >> (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 >> Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 >> state = enabled type = w >> new value: -3689348814741910324 >> (lldb) cont >> Process 28185 resuming >> >> Watchpoint 33 hit: >> old value: -3689348814741910324 >> new value: -3689348814741910836 >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> frame #0: 0x00000001000731dc >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6320 [opt] >> 6317 sqInt rawHeader; >> 6318 >> 6319 (method->cmType = CMMethod); >> -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); >> 6321 (method->blockSize = size); >> 6322 (method->methodObject = methodObj); >> 6323 >> Target 0: (Squeak) stopped. >> (lldb) >> Process 28185 resuming >> >> Watchpoint 33 hit: >> old value: -3689348814741910836 >> new value: -3689571654825360894 >> Process 28185 stopped >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> frame #0: 0x0000000100073255 >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6339 [opt] >> 6336 (method->selector = selector); >> 6337 (method->cmNumArgs = >> argumentCountOfMethodHeader(methodHeader)); >> 6338 (method->cmHasMovableLiteral = hasMovableLiteral); >> -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { >> 6340 addToYoungReferrers(method); >> 6341 } >> 6342 (method->cmUsageCount = initialMethodUsageCount()); >> Target 0: (Squeak) stopped. >> (lldb) bt >> * thread #1, queue = 'com.apple.main-thread', stop reason = >> watchpoint 33 >> * frame #0: 0x0000000100073255 >> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, >> selector=4435118304) at cogitX64SysV.c:6339 [opt] >> frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] >> generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] >> frame #2: 0x000000010004f7a5 >> Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] >> frame #3: 0x000000010004bc54 >> Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at >> cogitX64SysV.c:4833 [opt] >> frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at >> gcc3x-cointerp.c:27237 [opt] >> frame #5: 0x0000000107bd1925 >> frame #6: 0x000000010000236b Squeak`interpret at >> gcc3x-cointerp.c:2767 [opt] >> frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication >> runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at >> sqSqueakMainApplication.m:201 [opt] >> frame #8: 0x00007fff44b077b8 >> Foundation`__NSFirePerformWithOrder + 360 >> frame #9: 0x00007fff4297cf57 >> CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ >> + 23 >> frame #10: 0x00007fff4297ce7f >> CoreFoundation`__CFRunLoopDoObservers + 527 >> frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + >> 1240 >> frame #12: 0x00007fff4295ec93 >> CoreFoundation`CFRunLoopRunSpecific + 483 >> frame #13: 0x00007fff41c49d96 >> HIToolbox`RunCurrentEventLoopInMode + 286 >> frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon >> + 366 >> frame #15: 0x00007fff41c49884 >> HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 >> frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 >> frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) >> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 >> frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 >> frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 >> frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 >> (lldb) >> >> Here I don't know how to proceed, because I have no knowledge of >> this part of the VM, it's also quite late now... >> Eliot, does it ring some bell? >> >> > Ah, forget it, I thought it was overwriting, but it is the first > write on yet virgin methodeZone... > (it's just that the scenario was different from all the previous run) > > Le dim. 22 sept. 2019 à 11:21, Jan Vrany jan.vrany@fit.cvut.cz a >> écrit : >> >>> >>> On 2019-09-22 10:57, Nicolas Cellier wrote: >>> > Hi Jan, >>> > Thanks, a recording debugger would indeed be a tremendous help! >>> > Unfortunately, I'm unable to reproduce the bug under linux, only >>> > macosx... >>> > Do you know something equivalent to rr for mac? >>> >>> No, I don't. There might be something though, I never had a mac... >>> >>> Jan >>> >>> > >>> > Le dim. 22 sept. 2019 à 09:55, Jan Vrany jan.vrany@fit.cvut.cz >>> a >>> > écrit : >>> > >>> >> Hi, >>> >> >>> >>> >>> >>> - can anyone think of any other strategies I might take to try >>> and >>> >>> reproduce this? >>> >>> >>> >> at the risk of stating the obvious, in cases similar to this rr >>> (+ >>> >> VDB) >>> >> helped >>> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE >>> >> >>> >> Never used with opensmalltalk VM, though. >>> >> >>> >> HTH, Jan >>> >>
-- _,,,^..^,,,_ best, Eliot
Hi Jan,
this could be very helpful. Care to hold my hand? Does it work on macOS? (we're only seeing the crash on macOS, and then only sporadically).
On Sun, Sep 22, 2019 at 12:55 AM Jan Vrany jan.vrany@fit.cvut.cz wrote:
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
On Sun, 2019-09-29 at 15:49 -0700, Eliot Miranda wrote:
Hi Jan,
this could be very helpful. Care to hold my hand?
Sure!
Does it work on macOS?
Nope. rr is linux-only. Nicolas made some research and found nothing similar, see his previous emails.
(we're only seeing the crash on macOS, and then only sporadically).
On Sun, Sep 22, 2019 at 12:55 AM Jan Vrany jan.vrany@fit.cvut.cz wrote:
Hi,
- can anyone think of any other strategies I might take to try and
reproduce this?
at the risk of stating the obvious, in cases similar to this rr (+ VDB) helped me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE
Never used with opensmalltalk VM, though.
HTH, Jan
vm-dev@lists.squeakfoundation.org