[Vm-dev] [squeak-dev] Difficult to debug VM crash with full blocks and Sista V1

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Sat Sep 21 06:59:17 UTC 2019


Hi Eliot,
thanks for the heads up. Also remember that I was not able to produce such
crash on Windows nor linux spur64, and I don't remember of anyone reporting
it...
This make it not obvious that we can produce it in the Simulator which is
yet another machine...
I also never saw such problem in spur32.

On the low level side (lldb/gdb) it is possible to set watchpoint on this
code zone
watchpoint set expression --size 10 -- 0x10cefe70a
w s e -s 2 -- 0x10cefe70a
Don't know if feasible (is there any compaction in cog method zone? is the
debugger fast enough with the instrumentation?)

Le ven. 20 sept. 2019 à 21:38, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :

>
> Hi All,
>
>     two further reports.  One is that attempting to simulate SocketTest
> suite run with a small code zone to provoke code comp-actions did not
> produce a crash.  Sad, because I left it running over night.  There's still
> more work to do with simulation here though because the simulation is far
> from perfect; only two out of 14 tests pass.  At least they run ;-)
>
> 14 run, 2 passes, 0 expected failures, 6 failures, 6 errors, 0 unexpected
> passes
> squeak>
>
> The other report is that I have much more information on the nature of the
> crash and reason to suspect something to do with code zone management.  The
> JITted version of the Delay class>>#startTimerEventLoop method is badly
> corrupted from its last field on:
>
> (lldb) call printCogMethodFor((void *)0x10cefe70a)
>        0x10cefe288 <->        0x10cefe760: method:        0x1117a96a8
> selector:     0x7fff58e9af6d " is no where obvious"
>
> *(CogMethod *)0x10cefe288
> (CogMethod) $14 = {
>   objectHeader = 0x008000000a000035
>   cmNumArgs = 0
>   cmType = 2 CMMethod
>   cmRefersToYoung = false
>   cpicHasMNUCaseOrCMIsFullBlock = false
>   cmUsageCount = 1
>   cmUsesPenultimateLit = false
>   cbUsesInstVars = false
>   cmHasMovableLiteral = true
>   cmUnusedFlag = 0x00000001
>   stackCheckOffset = 116/0x74
>   blockSize = 1240/0x4d8
>   blockEntryOffset = 1154/0x482 (1149/0x479 in simulator)
>   methodObject = 0x1117a96a8
>   methodHeader = 0x00000000000000b1
>   selector = 0x00007fff58e9af6d
> }
>
> Everything is fine up to the selector field which is well out of range of
> the heap, an odd value ending in 16rD, and something followed by garbage
> instructions:
>
> (lldb) print/x (CogMethod *)0x10cefe288 + 1
> (CogMethod *) $24 = 0x000000010cefe2b0
> (lldb) x/10i 0x000000010cefe2b0
>     0x10cefe2b0: d0 0c bd ef fe 7f 00 rorb   0x7ffeef(,%rdi,4)
>     0x10cefe2b7: 00 48 89 addb   %cl, -0x77(%rax)
>     0x10cefe2ba: d0 48 83 rorb   -0x7d(%rax)
>     0x10cefe2bd: e0 07 loopne 0x10cefe2c6
>     0x10cefe2bf: 75 09 jne    0x10cefe2ca
>     0x10cefe2c1: 48 8b 02 movq   (%rdx), %rax
>     0x10cefe2c4: 48 25 ff ff 3f 00 andq   $0x3fffff, %rax           ; imm
> = 0x3FFFFF
>     0x10cefe2ca: 48 39 c8 cmpq   %rcx, %rax
>     0x10cefe2cd: 75 e4 jne    0x10cefe2b3
>     0x10cefe2cf: 4c 8b de movq   %rsi, %r11
>
> the crash occurs I believe when the sort block for SuspendedDelays created
> in Delay class>>#startTimerEventLoop is activated via the code
> in CoInterpreter>>#executeCogBlock:closure:mayContextSwitch: which jumps to
> the blockEntry code for the method (code that dispatches between the two
> blocks in startTimerEventLoop).  This dispatch code is also corrupted:
>
> (lldb) print/x 0x10cefe288 + 0x482 (CogMethod for startTimerEventLoop +
> its blockEntryOffset)
> (long) $36 = 0x000000010cefe70a
>
> (lldb) x/10i 0x10cefe70a
>     0x10cefe70a: e7 ef  outl   %eax, $0xef
>     0x10cefe70c: 0c 01  orb    $0x1, %al
>     0x10cefe70e: 00 00  addb   %al, (%rax)
>     0x10cefe710: 00 00  addb   %al, (%rax)
>     0x10cefe712: 00 00  addb   %al, (%rax)
>
> whereas it should look like
> 00002c74: xorq %r9, %r9 : 4D 31 C9
> 00002c77: jmp .+0x3 (0x2c7c=startTimerEventLoop at 47C) : EB 03
> blockEntry:
> 00002c79: movq %rdx, %r9 : 49 89 D1
> 00002c7c: movq methodDict at 16(%rdx), %rax : 48 8B 42 10
> 00002c80: cmpq $0x661, %rax : 48 3D 61 06 00 00
> 00002c86: jle .-0xE4 (0x2ba8=startTimerEventLoop at 3A8) : 0F 8E 1C FF FF FF
> block startpc: CB/659
> 00002c8c: jmp .-0x66 (0x2c28=startTimerEventLoop at 428) : EB 9A
> block startpc: DF/6F9
>
> So my current best guess is that JITted version of the Delay
> class>>#startTimerEventLoop method is corrupted by some bug in code zone
> compaction.
>
> A second route is indicated by the prim trace log which strangely shows
> two stack overflow events, not one, very soon before the  crash, which
> could indicate a bug in the interpreter/jit control flow on context switch:
>
> (lldb) call dumpPrimTraceLog()
>
>>
> forceDisplayUpdate
>
> utcMicrosecondClock
>
> utcMicrosecondClock
>
> findNextUnwindContextUpTo:
>
> tempAt:
>
> tempAt:put:
>
> tempAt:
>
> terminateTo:
>
> tempAt:put:
>
> findNextUnwindContextUpTo:
>
> terminateTo:
>
> basicNew:
>
> primSocketConnectionStatus:
>
> utcMicrosecondClock
>
> primSocketSendDone:
>
> primSocket:sendData:startIndex:count:
>
> utcMicrosecondClock
>
> primSocketSendDone:
>
> primSocketConnectionStatus:
>
> utcMicrosecondClock
>
> **StackOverflow**
>
> **StackOverflow**
>
> wait
>
> signal
>
> utcMicrosecondClock(newMethod -> 0x10cec3230 : 0x111f565a8)
>
> This log is showing non-jitted primitive invocations and context
> switch/code zone reclamation events outside of JITted code (for obvious
> performance reasons we don't record such events in JITted code).
>
> So, still more digging to narrow this down, and possibly fixing simulation
> of the suite is worth-while.  But what I really need is a more reproducible
> case in the real VM.  Sigh :-)
> _,,,^..^,,,_
> best, Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190921/908b5c9a/attachment-0001.html>


More information about the Vm-dev mailing list