[Vm-dev] [squeak-dev] Difficult to debug VM crash with full blocks and Sista V1
Nicolas Cellier
nicolas.cellier.aka.nice at gmail.com
Sat Sep 21 06:59:17 UTC 2019
Hi Eliot,
thanks for the heads up. Also remember that I was not able to produce such
crash on Windows nor linux spur64, and I don't remember of anyone reporting
it...
This make it not obvious that we can produce it in the Simulator which is
yet another machine...
I also never saw such problem in spur32.
On the low level side (lldb/gdb) it is possible to set watchpoint on this
code zone
watchpoint set expression --size 10 -- 0x10cefe70a
w s e -s 2 -- 0x10cefe70a
Don't know if feasible (is there any compaction in cog method zone? is the
debugger fast enough with the instrumentation?)
Le ven. 20 sept. 2019 à 21:38, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :
>
> Hi All,
>
> two further reports. One is that attempting to simulate SocketTest
> suite run with a small code zone to provoke code comp-actions did not
> produce a crash. Sad, because I left it running over night. There's still
> more work to do with simulation here though because the simulation is far
> from perfect; only two out of 14 tests pass. At least they run ;-)
>
> 14 run, 2 passes, 0 expected failures, 6 failures, 6 errors, 0 unexpected
> passes
> squeak>
>
> The other report is that I have much more information on the nature of the
> crash and reason to suspect something to do with code zone management. The
> JITted version of the Delay class>>#startTimerEventLoop method is badly
> corrupted from its last field on:
>
> (lldb) call printCogMethodFor((void *)0x10cefe70a)
> 0x10cefe288 <-> 0x10cefe760: method: 0x1117a96a8
> selector: 0x7fff58e9af6d " is no where obvious"
>
> *(CogMethod *)0x10cefe288
> (CogMethod) $14 = {
> objectHeader = 0x008000000a000035
> cmNumArgs = 0
> cmType = 2 CMMethod
> cmRefersToYoung = false
> cpicHasMNUCaseOrCMIsFullBlock = false
> cmUsageCount = 1
> cmUsesPenultimateLit = false
> cbUsesInstVars = false
> cmHasMovableLiteral = true
> cmUnusedFlag = 0x00000001
> stackCheckOffset = 116/0x74
> blockSize = 1240/0x4d8
> blockEntryOffset = 1154/0x482 (1149/0x479 in simulator)
> methodObject = 0x1117a96a8
> methodHeader = 0x00000000000000b1
> selector = 0x00007fff58e9af6d
> }
>
> Everything is fine up to the selector field which is well out of range of
> the heap, an odd value ending in 16rD, and something followed by garbage
> instructions:
>
> (lldb) print/x (CogMethod *)0x10cefe288 + 1
> (CogMethod *) $24 = 0x000000010cefe2b0
> (lldb) x/10i 0x000000010cefe2b0
> 0x10cefe2b0: d0 0c bd ef fe 7f 00 rorb 0x7ffeef(,%rdi,4)
> 0x10cefe2b7: 00 48 89 addb %cl, -0x77(%rax)
> 0x10cefe2ba: d0 48 83 rorb -0x7d(%rax)
> 0x10cefe2bd: e0 07 loopne 0x10cefe2c6
> 0x10cefe2bf: 75 09 jne 0x10cefe2ca
> 0x10cefe2c1: 48 8b 02 movq (%rdx), %rax
> 0x10cefe2c4: 48 25 ff ff 3f 00 andq $0x3fffff, %rax ; imm
> = 0x3FFFFF
> 0x10cefe2ca: 48 39 c8 cmpq %rcx, %rax
> 0x10cefe2cd: 75 e4 jne 0x10cefe2b3
> 0x10cefe2cf: 4c 8b de movq %rsi, %r11
>
> the crash occurs I believe when the sort block for SuspendedDelays created
> in Delay class>>#startTimerEventLoop is activated via the code
> in CoInterpreter>>#executeCogBlock:closure:mayContextSwitch: which jumps to
> the blockEntry code for the method (code that dispatches between the two
> blocks in startTimerEventLoop). This dispatch code is also corrupted:
>
> (lldb) print/x 0x10cefe288 + 0x482 (CogMethod for startTimerEventLoop +
> its blockEntryOffset)
> (long) $36 = 0x000000010cefe70a
>
> (lldb) x/10i 0x10cefe70a
> 0x10cefe70a: e7 ef outl %eax, $0xef
> 0x10cefe70c: 0c 01 orb $0x1, %al
> 0x10cefe70e: 00 00 addb %al, (%rax)
> 0x10cefe710: 00 00 addb %al, (%rax)
> 0x10cefe712: 00 00 addb %al, (%rax)
>
> whereas it should look like
> 00002c74: xorq %r9, %r9 : 4D 31 C9
> 00002c77: jmp .+0x3 (0x2c7c=startTimerEventLoop at 47C) : EB 03
> blockEntry:
> 00002c79: movq %rdx, %r9 : 49 89 D1
> 00002c7c: movq methodDict at 16(%rdx), %rax : 48 8B 42 10
> 00002c80: cmpq $0x661, %rax : 48 3D 61 06 00 00
> 00002c86: jle .-0xE4 (0x2ba8=startTimerEventLoop at 3A8) : 0F 8E 1C FF FF FF
> block startpc: CB/659
> 00002c8c: jmp .-0x66 (0x2c28=startTimerEventLoop at 428) : EB 9A
> block startpc: DF/6F9
>
> So my current best guess is that JITted version of the Delay
> class>>#startTimerEventLoop method is corrupted by some bug in code zone
> compaction.
>
> A second route is indicated by the prim trace log which strangely shows
> two stack overflow events, not one, very soon before the crash, which
> could indicate a bug in the interpreter/jit control flow on context switch:
>
> (lldb) call dumpPrimTraceLog()
>
> …
>
> forceDisplayUpdate
>
> utcMicrosecondClock
>
> utcMicrosecondClock
>
> findNextUnwindContextUpTo:
>
> tempAt:
>
> tempAt:put:
>
> tempAt:
>
> terminateTo:
>
> tempAt:put:
>
> findNextUnwindContextUpTo:
>
> terminateTo:
>
> basicNew:
>
> primSocketConnectionStatus:
>
> utcMicrosecondClock
>
> primSocketSendDone:
>
> primSocket:sendData:startIndex:count:
>
> utcMicrosecondClock
>
> primSocketSendDone:
>
> primSocketConnectionStatus:
>
> utcMicrosecondClock
>
> **StackOverflow**
>
> **StackOverflow**
>
> wait
>
> signal
>
> utcMicrosecondClock(newMethod -> 0x10cec3230 : 0x111f565a8)
>
> This log is showing non-jitted primitive invocations and context
> switch/code zone reclamation events outside of JITted code (for obvious
> performance reasons we don't record such events in JITted code).
>
> So, still more digging to narrow this down, and possibly fixing simulation
> of the suite is worth-while. But what I really need is a more reproducible
> case in the real VM. Sigh :-)
> _,,,^..^,,,_
> best, Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190921/908b5c9a/attachment-0001.html>
More information about the Vm-dev
mailing list