[Vm-dev] [squeak-dev] Difficult to debug VM crash with full blocks and Sista V1

Eliot Miranda eliot.miranda at gmail.com
Fri Sep 20 19:37:52 UTC 2019


Hi All,

    two further reports.  One is that attempting to simulate SocketTest
suite run with a small code zone to provoke code comp-actions did not
produce a crash.  Sad, because I left it running over night.  There's still
more work to do with simulation here though because the simulation is far
from perfect; only two out of 14 tests pass.  At least they run ;-)

14 run, 2 passes, 0 expected failures, 6 failures, 6 errors, 0 unexpected
passes
squeak>

The other report is that I have much more information on the nature of the
crash and reason to suspect something to do with code zone management.  The
JITted version of the Delay class>>#startTimerEventLoop method is badly
corrupted from its last field on:

(lldb) call printCogMethodFor((void *)0x10cefe70a)
       0x10cefe288 <->        0x10cefe760: method:        0x1117a96a8
selector:     0x7fff58e9af6d " is no where obvious"

*(CogMethod *)0x10cefe288
(CogMethod) $14 = {
  objectHeader = 0x008000000a000035
  cmNumArgs = 0
  cmType = 2 CMMethod
  cmRefersToYoung = false
  cpicHasMNUCaseOrCMIsFullBlock = false
  cmUsageCount = 1
  cmUsesPenultimateLit = false
  cbUsesInstVars = false
  cmHasMovableLiteral = true
  cmUnusedFlag = 0x00000001
  stackCheckOffset = 116/0x74
  blockSize = 1240/0x4d8
  blockEntryOffset = 1154/0x482 (1149/0x479 in simulator)
  methodObject = 0x1117a96a8
  methodHeader = 0x00000000000000b1
  selector = 0x00007fff58e9af6d
}

Everything is fine up to the selector field which is well out of range of
the heap, an odd value ending in 16rD, and something followed by garbage
instructions:

(lldb) print/x (CogMethod *)0x10cefe288 + 1
(CogMethod *) $24 = 0x000000010cefe2b0
(lldb) x/10i 0x000000010cefe2b0
    0x10cefe2b0: d0 0c bd ef fe 7f 00 rorb   0x7ffeef(,%rdi,4)
    0x10cefe2b7: 00 48 89 addb   %cl, -0x77(%rax)
    0x10cefe2ba: d0 48 83 rorb   -0x7d(%rax)
    0x10cefe2bd: e0 07 loopne 0x10cefe2c6
    0x10cefe2bf: 75 09 jne    0x10cefe2ca
    0x10cefe2c1: 48 8b 02 movq   (%rdx), %rax
    0x10cefe2c4: 48 25 ff ff 3f 00 andq   $0x3fffff, %rax           ; imm =
0x3FFFFF
    0x10cefe2ca: 48 39 c8 cmpq   %rcx, %rax
    0x10cefe2cd: 75 e4 jne    0x10cefe2b3
    0x10cefe2cf: 4c 8b de movq   %rsi, %r11

the crash occurs I believe when the sort block for SuspendedDelays created
in Delay class>>#startTimerEventLoop is activated via the code
in CoInterpreter>>#executeCogBlock:closure:mayContextSwitch: which jumps to
the blockEntry code for the method (code that dispatches between the two
blocks in startTimerEventLoop).  This dispatch code is also corrupted:

(lldb) print/x 0x10cefe288 + 0x482 (CogMethod for startTimerEventLoop + its
blockEntryOffset)
(long) $36 = 0x000000010cefe70a

(lldb) x/10i 0x10cefe70a
    0x10cefe70a: e7 ef  outl   %eax, $0xef
    0x10cefe70c: 0c 01  orb    $0x1, %al
    0x10cefe70e: 00 00  addb   %al, (%rax)
    0x10cefe710: 00 00  addb   %al, (%rax)
    0x10cefe712: 00 00  addb   %al, (%rax)

whereas it should look like
00002c74: xorq %r9, %r9 : 4D 31 C9
00002c77: jmp .+0x3 (0x2c7c=startTimerEventLoop at 47C) : EB 03
blockEntry:
00002c79: movq %rdx, %r9 : 49 89 D1
00002c7c: movq methodDict at 16(%rdx), %rax : 48 8B 42 10
00002c80: cmpq $0x661, %rax : 48 3D 61 06 00 00
00002c86: jle .-0xE4 (0x2ba8=startTimerEventLoop at 3A8) : 0F 8E 1C FF FF FF
block startpc: CB/659
00002c8c: jmp .-0x66 (0x2c28=startTimerEventLoop at 428) : EB 9A
block startpc: DF/6F9

So my current best guess is that JITted version of the Delay
class>>#startTimerEventLoop method is corrupted by some bug in code zone
compaction.

A second route is indicated by the prim trace log which strangely shows two
stack overflow events, not one, very soon before the  crash, which could
indicate a bug in the interpreter/jit control flow on context switch:

(lldb) call dumpPrimTraceLog()

…

forceDisplayUpdate

utcMicrosecondClock

utcMicrosecondClock

findNextUnwindContextUpTo:

tempAt:

tempAt:put:

tempAt:

terminateTo:

tempAt:put:

findNextUnwindContextUpTo:

terminateTo:

basicNew:

primSocketConnectionStatus:

utcMicrosecondClock

primSocketSendDone:

primSocket:sendData:startIndex:count:

utcMicrosecondClock

primSocketSendDone:

primSocketConnectionStatus:

utcMicrosecondClock

**StackOverflow**

**StackOverflow**

wait

signal

utcMicrosecondClock(newMethod -> 0x10cec3230 : 0x111f565a8)

This log is showing non-jitted primitive invocations and context
switch/code zone reclamation events outside of JITted code (for obvious
performance reasons we don't record such events in JITted code).

So, still more digging to narrow this down, and possibly fixing simulation
of the suite is worth-while.  But what I really need is a more reproducible
case in the real VM.  Sigh :-)
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190920/c4a0581b/attachment.html>


More information about the Vm-dev mailing list