[Vm-dev] [squeak-dev] Difficult to debug VM crash with full blocks and Sista V1
Eliot Miranda
eliot.miranda at gmail.com
Fri Sep 20 19:37:52 UTC 2019
Hi All,
two further reports. One is that attempting to simulate SocketTest
suite run with a small code zone to provoke code comp-actions did not
produce a crash. Sad, because I left it running over night. There's still
more work to do with simulation here though because the simulation is far
from perfect; only two out of 14 tests pass. At least they run ;-)
14 run, 2 passes, 0 expected failures, 6 failures, 6 errors, 0 unexpected
passes
squeak>
The other report is that I have much more information on the nature of the
crash and reason to suspect something to do with code zone management. The
JITted version of the Delay class>>#startTimerEventLoop method is badly
corrupted from its last field on:
(lldb) call printCogMethodFor((void *)0x10cefe70a)
0x10cefe288 <-> 0x10cefe760: method: 0x1117a96a8
selector: 0x7fff58e9af6d " is no where obvious"
*(CogMethod *)0x10cefe288
(CogMethod) $14 = {
objectHeader = 0x008000000a000035
cmNumArgs = 0
cmType = 2 CMMethod
cmRefersToYoung = false
cpicHasMNUCaseOrCMIsFullBlock = false
cmUsageCount = 1
cmUsesPenultimateLit = false
cbUsesInstVars = false
cmHasMovableLiteral = true
cmUnusedFlag = 0x00000001
stackCheckOffset = 116/0x74
blockSize = 1240/0x4d8
blockEntryOffset = 1154/0x482 (1149/0x479 in simulator)
methodObject = 0x1117a96a8
methodHeader = 0x00000000000000b1
selector = 0x00007fff58e9af6d
}
Everything is fine up to the selector field which is well out of range of
the heap, an odd value ending in 16rD, and something followed by garbage
instructions:
(lldb) print/x (CogMethod *)0x10cefe288 + 1
(CogMethod *) $24 = 0x000000010cefe2b0
(lldb) x/10i 0x000000010cefe2b0
0x10cefe2b0: d0 0c bd ef fe 7f 00 rorb 0x7ffeef(,%rdi,4)
0x10cefe2b7: 00 48 89 addb %cl, -0x77(%rax)
0x10cefe2ba: d0 48 83 rorb -0x7d(%rax)
0x10cefe2bd: e0 07 loopne 0x10cefe2c6
0x10cefe2bf: 75 09 jne 0x10cefe2ca
0x10cefe2c1: 48 8b 02 movq (%rdx), %rax
0x10cefe2c4: 48 25 ff ff 3f 00 andq $0x3fffff, %rax ; imm =
0x3FFFFF
0x10cefe2ca: 48 39 c8 cmpq %rcx, %rax
0x10cefe2cd: 75 e4 jne 0x10cefe2b3
0x10cefe2cf: 4c 8b de movq %rsi, %r11
the crash occurs I believe when the sort block for SuspendedDelays created
in Delay class>>#startTimerEventLoop is activated via the code
in CoInterpreter>>#executeCogBlock:closure:mayContextSwitch: which jumps to
the blockEntry code for the method (code that dispatches between the two
blocks in startTimerEventLoop). This dispatch code is also corrupted:
(lldb) print/x 0x10cefe288 + 0x482 (CogMethod for startTimerEventLoop + its
blockEntryOffset)
(long) $36 = 0x000000010cefe70a
(lldb) x/10i 0x10cefe70a
0x10cefe70a: e7 ef outl %eax, $0xef
0x10cefe70c: 0c 01 orb $0x1, %al
0x10cefe70e: 00 00 addb %al, (%rax)
0x10cefe710: 00 00 addb %al, (%rax)
0x10cefe712: 00 00 addb %al, (%rax)
whereas it should look like
00002c74: xorq %r9, %r9 : 4D 31 C9
00002c77: jmp .+0x3 (0x2c7c=startTimerEventLoop at 47C) : EB 03
blockEntry:
00002c79: movq %rdx, %r9 : 49 89 D1
00002c7c: movq methodDict at 16(%rdx), %rax : 48 8B 42 10
00002c80: cmpq $0x661, %rax : 48 3D 61 06 00 00
00002c86: jle .-0xE4 (0x2ba8=startTimerEventLoop at 3A8) : 0F 8E 1C FF FF FF
block startpc: CB/659
00002c8c: jmp .-0x66 (0x2c28=startTimerEventLoop at 428) : EB 9A
block startpc: DF/6F9
So my current best guess is that JITted version of the Delay
class>>#startTimerEventLoop method is corrupted by some bug in code zone
compaction.
A second route is indicated by the prim trace log which strangely shows two
stack overflow events, not one, very soon before the crash, which could
indicate a bug in the interpreter/jit control flow on context switch:
(lldb) call dumpPrimTraceLog()
…
forceDisplayUpdate
utcMicrosecondClock
utcMicrosecondClock
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
tempAt:put:
findNextUnwindContextUpTo:
terminateTo:
basicNew:
primSocketConnectionStatus:
utcMicrosecondClock
primSocketSendDone:
primSocket:sendData:startIndex:count:
utcMicrosecondClock
primSocketSendDone:
primSocketConnectionStatus:
utcMicrosecondClock
**StackOverflow**
**StackOverflow**
wait
signal
utcMicrosecondClock(newMethod -> 0x10cec3230 : 0x111f565a8)
This log is showing non-jitted primitive invocations and context
switch/code zone reclamation events outside of JITted code (for obvious
performance reasons we don't record such events in JITted code).
So, still more digging to narrow this down, and possibly fixing simulation
of the suite is worth-while. But what I really need is a more reproducible
case in the real VM. Sigh :-)
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190920/c4a0581b/attachment.html>
More information about the Vm-dev
mailing list