[Vm-dev] [squeak-dev] Difficult to debug VM crash with full blocks and Sista V1

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Sat Sep 21 17:52:39 UTC 2019


So there is compaction of cog method zone of course... Now that I have a
VMMaker image under the eyes it's clear:
cogit compactCogCompiledCode.

However, I did put a break,
(lldb) br se -n compactCogCompiledCode
then ran my favourite SocketTest suite run,
and got a EXC_BAD_INSTRUCTION error before any compaction occurred!

So watching the cogMethodZone sounds feasible...
I'll give it a try.


Le sam. 21 sept. 2019 à 08:59, Nicolas Cellier <
nicolas.cellier.aka.nice at gmail.com> a écrit :

> Hi Eliot,
> thanks for the heads up. Also remember that I was not able to produce such
> crash on Windows nor linux spur64, and I don't remember of anyone reporting
> it...
> This make it not obvious that we can produce it in the Simulator which is
> yet another machine...
> I also never saw such problem in spur32.
>
> On the low level side (lldb/gdb) it is possible to set watchpoint on this
> code zone
> watchpoint set expression --size 10 -- 0x10cefe70a
> w s e -s 2 -- 0x10cefe70a
> Don't know if feasible (is there any compaction in cog method zone? is the
> debugger fast enough with the instrumentation?)
>
> Le ven. 20 sept. 2019 à 21:38, Eliot Miranda <eliot.miranda at gmail.com> a
> écrit :
>
>>
>> Hi All,
>>
>>     two further reports.  One is that attempting to simulate SocketTest
>> suite run with a small code zone to provoke code comp-actions did not
>> produce a crash.  Sad, because I left it running over night.  There's still
>> more work to do with simulation here though because the simulation is far
>> from perfect; only two out of 14 tests pass.  At least they run ;-)
>>
>> 14 run, 2 passes, 0 expected failures, 6 failures, 6 errors, 0 unexpected
>> passes
>> squeak>
>>
>> The other report is that I have much more information on the nature of
>> the crash and reason to suspect something to do with code zone management.
>> The JITted version of the Delay class>>#startTimerEventLoop method is badly
>> corrupted from its last field on:
>>
>> (lldb) call printCogMethodFor((void *)0x10cefe70a)
>>        0x10cefe288 <->        0x10cefe760: method:        0x1117a96a8
>> selector:     0x7fff58e9af6d " is no where obvious"
>>
>> *(CogMethod *)0x10cefe288
>> (CogMethod) $14 = {
>>   objectHeader = 0x008000000a000035
>>   cmNumArgs = 0
>>   cmType = 2 CMMethod
>>   cmRefersToYoung = false
>>   cpicHasMNUCaseOrCMIsFullBlock = false
>>   cmUsageCount = 1
>>   cmUsesPenultimateLit = false
>>   cbUsesInstVars = false
>>   cmHasMovableLiteral = true
>>   cmUnusedFlag = 0x00000001
>>   stackCheckOffset = 116/0x74
>>   blockSize = 1240/0x4d8
>>   blockEntryOffset = 1154/0x482 (1149/0x479 in simulator)
>>   methodObject = 0x1117a96a8
>>   methodHeader = 0x00000000000000b1
>>   selector = 0x00007fff58e9af6d
>> }
>>
>> Everything is fine up to the selector field which is well out of range of
>> the heap, an odd value ending in 16rD, and something followed by garbage
>> instructions:
>>
>> (lldb) print/x (CogMethod *)0x10cefe288 + 1
>> (CogMethod *) $24 = 0x000000010cefe2b0
>> (lldb) x/10i 0x000000010cefe2b0
>>     0x10cefe2b0: d0 0c bd ef fe 7f 00 rorb   0x7ffeef(,%rdi,4)
>>     0x10cefe2b7: 00 48 89 addb   %cl, -0x77(%rax)
>>     0x10cefe2ba: d0 48 83 rorb   -0x7d(%rax)
>>     0x10cefe2bd: e0 07 loopne 0x10cefe2c6
>>     0x10cefe2bf: 75 09 jne    0x10cefe2ca
>>     0x10cefe2c1: 48 8b 02 movq   (%rdx), %rax
>>     0x10cefe2c4: 48 25 ff ff 3f 00 andq   $0x3fffff, %rax           ; imm
>> = 0x3FFFFF
>>     0x10cefe2ca: 48 39 c8 cmpq   %rcx, %rax
>>     0x10cefe2cd: 75 e4 jne    0x10cefe2b3
>>     0x10cefe2cf: 4c 8b de movq   %rsi, %r11
>>
>> the crash occurs I believe when the sort block for SuspendedDelays
>> created in Delay class>>#startTimerEventLoop is activated via the code
>> in CoInterpreter>>#executeCogBlock:closure:mayContextSwitch: which jumps to
>> the blockEntry code for the method (code that dispatches between the two
>> blocks in startTimerEventLoop).  This dispatch code is also corrupted:
>>
>> (lldb) print/x 0x10cefe288 + 0x482 (CogMethod for startTimerEventLoop +
>> its blockEntryOffset)
>> (long) $36 = 0x000000010cefe70a
>>
>> (lldb) x/10i 0x10cefe70a
>>     0x10cefe70a: e7 ef  outl   %eax, $0xef
>>     0x10cefe70c: 0c 01  orb    $0x1, %al
>>     0x10cefe70e: 00 00  addb   %al, (%rax)
>>     0x10cefe710: 00 00  addb   %al, (%rax)
>>     0x10cefe712: 00 00  addb   %al, (%rax)
>>
>> whereas it should look like
>> 00002c74: xorq %r9, %r9 : 4D 31 C9
>> 00002c77: jmp .+0x3 (0x2c7c=startTimerEventLoop at 47C) : EB 03
>> blockEntry:
>> 00002c79: movq %rdx, %r9 : 49 89 D1
>> 00002c7c: movq methodDict at 16(%rdx), %rax : 48 8B 42 10
>> 00002c80: cmpq $0x661, %rax : 48 3D 61 06 00 00
>> 00002c86: jle .-0xE4 (0x2ba8=startTimerEventLoop at 3A8) : 0F 8E 1C FF FF
>> FF
>> block startpc: CB/659
>> 00002c8c: jmp .-0x66 (0x2c28=startTimerEventLoop at 428) : EB 9A
>> block startpc: DF/6F9
>>
>> So my current best guess is that JITted version of the Delay
>> class>>#startTimerEventLoop method is corrupted by some bug in code zone
>> compaction.
>>
>> A second route is indicated by the prim trace log which strangely shows
>> two stack overflow events, not one, very soon before the  crash, which
>> could indicate a bug in the interpreter/jit control flow on context switch:
>>
>> (lldb) call dumpPrimTraceLog()
>>
>>>>
>> forceDisplayUpdate
>>
>> utcMicrosecondClock
>>
>> utcMicrosecondClock
>>
>> findNextUnwindContextUpTo:
>>
>> tempAt:
>>
>> tempAt:put:
>>
>> tempAt:
>>
>> terminateTo:
>>
>> tempAt:put:
>>
>> findNextUnwindContextUpTo:
>>
>> terminateTo:
>>
>> basicNew:
>>
>> primSocketConnectionStatus:
>>
>> utcMicrosecondClock
>>
>> primSocketSendDone:
>>
>> primSocket:sendData:startIndex:count:
>>
>> utcMicrosecondClock
>>
>> primSocketSendDone:
>>
>> primSocketConnectionStatus:
>>
>> utcMicrosecondClock
>>
>> **StackOverflow**
>>
>> **StackOverflow**
>>
>> wait
>>
>> signal
>>
>> utcMicrosecondClock(newMethod -> 0x10cec3230 : 0x111f565a8)
>>
>> This log is showing non-jitted primitive invocations and context
>> switch/code zone reclamation events outside of JITted code (for obvious
>> performance reasons we don't record such events in JITted code).
>>
>> So, still more digging to narrow this down, and possibly fixing
>> simulation of the suite is worth-while.  But what I really need is a more
>> reproducible case in the real VM.  Sigh :-)
>> _,,,^..^,,,_
>> best, Eliot
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190921/56f30eee/attachment.html>


More information about the Vm-dev mailing list