Hi Levente,
I directed stdout to a file and prior to the freeze, did:
pkill -USR1 -n -x squeak
a few times and noticed "stack overflow" mentioned twice. After the freeze, pkill sent nothing else to stdout. Maybe something in stderr? I'll try again and concatenate stderr to the end of stdout. Meanwhile, the file with the stack overflows is attached.
Thanks for the help.
- Dan
On Sat, 28 Jan 2017,*Levente Uzonyi* wrote:
Hi Dan,
You can send the USR1 signal to the VM process to make it write some debug information to the console. With this information you can easily tell what's happening. I suspect it's stuck in a long GC.
Levente
On Fri, 27 Jan 2017, Dan Norton wrote:
//>>/On Debian8, CogSpur64 5.0-201612221637, Cuis 5.0 3043... />>//>>/after lots of user interaction, views opening and closing, and />>/animation, the image becomes unresponsive. Cmd+. does nothing and the />>/clock in the Cuis taskbar no longer updates. The length of time to />>/produce this varies from 5 to 20 minutes. />>//>>/There seems to be no dump and no log. 'Smalltalk garbageCollectMost' />>/reports 25550736 to 30517872 over 34 samples. />>//>>/Sorry to be so vague. This has occurred with several images - sometimes />>/scrolling through a senders list, or stepping through a debugger, but />>/repeatably with one of my images. I can supply this image, warmed up so />>/that it might not take so long to reproduce the problem if you would />>/like. Or give me a hint as to how to narrow down the problem /
Highly speculative idea... from a shell could you log your system time to see if anything there correlates.
cheers -ben
On Sun, Jan 29, 2017 at 11:40 AM, Dan Norton dnorton@mindspring.com wrote:
Hi Levente,
I directed stdout to a file and prior to the freeze, did:
pkill -USR1 -n -x squeak
a few times and noticed "stack overflow" mentioned twice. After the freeze, pkill sent nothing else to stdout. Maybe something in stderr? I'll try again and concatenate stderr to the end of stdout. Meanwhile, the file with the stack overflows is attached.
Thanks for the help.
- Dan
On Sat, 28 Jan 2017, Levente Uzonyi wrote:
Hi Dan,
You can send the USR1 signal to the VM process to make it write some debug information to the console. With this information you can easily tell what's happening. I suspect it's stuck in a long GC.
Levente
On Fri, 27 Jan 2017, Dan Norton wrote:
On Debian8, CogSpur64 5.0-201612221637, Cuis 5.0 3043...
after lots of user interaction, views opening and closing, and animation, the image becomes unresponsive. Cmd+. does nothing and the clock in the Cuis taskbar no longer updates. The length of time to produce this varies from 5 to 20 minutes.
There seems to be no dump and no log. 'Smalltalk garbageCollectMost' reports 25550736 to 30517872 over 34 samples.
Sorry to be so vague. This has occurred with several images - sometimes scrolling through a senders list, or stepping through a debugger, but repeatably with one of my images. I can supply this image, warmed up so that it might not take so long to reproduce the problem if you would like. Or give me a hint as to how to narrow down the problem
On 01/29/2017 12:11 AM, Ben Coman wrote:
Highly speculative idea... from a shell could you log your system time to see if anything there correlates.
Hi Ben,
This is what I did:
# tail -f /var/log/messages > /home/dan/systime.log
Up to a freeze, this was written:
Jan 29 09:52:51 debian kernel: [ 12.957276] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Jan 29 09:52:55 debian org.a11y.Bus[1050]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.Bus[1050]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.atspi.Registry[1056]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:53:08 debian org.a11y.Bus[1127]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.Bus[1127]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.atspi.Registry[1182]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:57:50 debian rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="859" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Jan 29 10:07:44 debian rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ] Jan 29 10:07:44 debian rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ]
- Dan
cheers -ben
On Sun, Jan 29, 2017 at 11:40 AM, Dan Norton dnorton@mindspring.com wrote:
Hi Levente,
I directed stdout to a file and prior to the freeze, did:
pkill -USR1 -n -x squeak
a few times and noticed "stack overflow" mentioned twice. After the freeze, pkill sent nothing else to stdout. Maybe something in stderr? I'll try again and concatenate stderr to the end of stdout. Meanwhile, the file with the stack overflows is attached.
Thanks for the help.
- Dan
On Sat, 28 Jan 2017, Levente Uzonyi wrote:
Hi Dan, You can send the USR1 signal to the VM process to make it write some debug information to the console. With this information you can easily tell what's happening. I suspect it's stuck in a long GC. Levente On Fri, 27 Jan 2017, Dan Norton wrote:
On Debian8, CogSpur64 5.0-201612221637, Cuis 5.0 3043...
after lots of user interaction, views opening and closing, and animation, the image becomes unresponsive. Cmd+. does nothing and the clock in the Cuis taskbar no longer updates. The length of time to produce this varies from 5 to 20 minutes.
There seems to be no dump and no log. 'Smalltalk garbageCollectMost' reports 25550736 to 30517872 over 34 samples.
Sorry to be so vague. This has occurred with several images - sometimes scrolling through a senders list, or stepping through a debugger, but repeatably with one of my images. I can supply this image, warmed up so that it might not take so long to reproduce the problem if you would like. Or give me a hint as to how to narrow down the problem
btw, I was looking for time jumping backwards. Seems okay. cheers -ben
On Mon, Jan 30, 2017 at 2:56 AM, Dan Norton dnorton@mindspring.com wrote:
On 01/29/2017 12:11 AM, Ben Coman wrote:
Highly speculative idea... from a shell could you log your system time to see if anything there correlates.
Hi Ben,
This is what I did:
# tail -f /var/log/messages > /home/dan/systime.log
Up to a freeze, this was written:
Jan 29 09:52:51 debian kernel: [ 12.957276] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Jan 29 09:52:55 debian org.a11y.Bus[1050]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.Bus[1050]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.atspi.Registry[1056]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:53:08 debian org.a11y.Bus[1127]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.Bus[1127]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.atspi.Registry[1182]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:57:50 debian rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="859" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Jan 29 10:07:44 debian rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ] Jan 29 10:07:44 debian rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ]
- Dan
cheers -ben
On Sun, Jan 29, 2017 at 11:40 AM, Dan Norton dnorton@mindspring.com wrote:
Hi Levente,
I directed stdout to a file and prior to the freeze, did:
pkill -USR1 -n -x squeak
a few times and noticed "stack overflow" mentioned twice. After the freeze, pkill sent nothing else to stdout. Maybe something in stderr? I'll try again and concatenate stderr to the end of stdout. Meanwhile, the file with the stack overflows is attached.
Thanks for the help.
- Dan
On Sat, 28 Jan 2017, Levente Uzonyi wrote:
Hi Dan,
You can send the USR1 signal to the VM process to make it write some debug information to the console. With this information you can easily tell what's happening. I suspect it's stuck in a long GC. Levente On Fri, 27 Jan 2017, Dan Norton wrote:
On Debian8, CogSpur64 5.0-201612221637, Cuis 5.0 3043...
after lots of user interaction, views opening and closing, and animation, the image becomes unresponsive. Cmd+. does nothing and the clock in the Cuis taskbar no longer updates. The length of time to produce this varies from 5 to 20 minutes.
There seems to be no dump and no log. 'Smalltalk garbageCollectMost' reports 25550736 to 30517872 over 34 samples.
Sorry to be so vague. This has occurred with several images - sometimes scrolling through a senders list, or stepping through a debugger, but repeatably with one of my images. I can supply this image, warmed up so that it might not take so long to reproduce the problem if you would like. Or give me a hint as to how to narrow down the problem
OK - I was not sure how to "log my system time".
- Dan
On 01/29/2017 06:29 PM, Ben Coman wrote:
btw, I was looking for time jumping backwards. Seems okay. cheers -ben
On Mon, Jan 30, 2017 at 2:56 AM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
On 01/29/2017 12:11 AM, Ben Coman wrote: Highly speculative idea... from a shell could you log your system time to see if anything there correlates. Hi Ben, This is what I did: # tail -f /var/log/messages > /home/dan/systime.log Up to a freeze, this was written: Jan 29 09:52:51 debian kernel: [ 12.957276] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Jan 29 09:52:55 debian org.a11y.Bus[1050]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.Bus[1050]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:52:55 debian org.a11y.atspi.Registry[1056]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:53:08 debian org.a11y.Bus[1127]: Activating service name='org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.Bus[1127]: Successfully activated service 'org.a11y.atspi.Registry' Jan 29 09:53:08 debian org.a11y.atspi.Registry[1182]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Jan 29 09:57:50 debian rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="859" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Jan 29 10:07:44 debian rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ] Jan 29 10:07:44 debian rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ] - Dan cheers -ben On Sun, Jan 29, 2017 at 11:40 AM, Dan Norton <dnorton@mindspring.com <mailto:dnorton@mindspring.com>> wrote: Hi Levente, I directed stdout to a file and prior to the freeze, did: pkill -USR1 -n -x squeak a few times and noticed "stack overflow" mentioned twice. After the freeze, pkill sent nothing else to stdout. Maybe something in stderr? I'll try again and concatenate stderr to the end of stdout. Meanwhile, the file with the stack overflows is attached. Thanks for the help. - Dan On Sat, 28 Jan 2017, Levente Uzonyi wrote: Hi Dan, You can send the USR1 signal to the VM process to make it write some debug information to the console. With this information you can easily tell what's happening. I suspect it's stuck in a long GC. Levente On Fri, 27 Jan 2017, Dan Norton wrote: On Debian8, CogSpur64 5.0-201612221637, Cuis 5.0 3043... after lots of user interaction, views opening and closing, and animation, the image becomes unresponsive. Cmd+. does nothing and the clock in the Cuis taskbar no longer updates. The length of time to produce this varies from 5 to 20 minutes. There seems to be no dump and no log. 'Smalltalk garbageCollectMost' reports 25550736 to 30517872 over 34 samples. Sorry to be so vague. This has occurred with several images - sometimes scrolling through a senders list, or stepping through a debugger, but repeatably with one of my images. I can supply this image, warmed up so that it might not take so long to reproduce the problem if you would like. Or give me a hint as to how to narrow down the problem
On Mon, Jan 30, 2017 at 9:22 AM, Dan Norton dnorton@mindspring.com wrote:
OK - I was not sure how to "log my system time".
Something like this would do...
#!/bin/bash while true do before=$now now=`date +%H:%M:%S` echo "$now" b=`date --date="$before" +%s` n=`date --date="$now" +%s` if [ $b -gt $n ]; then echo "!!!" fi sleep 1 done
cheers -ben
Hi Dan,
The expected output is in the file you attached. There seem to be no active processes to run, so only the idle process is being executed:
All Smalltalk process stacks (active first): Process 0x1c9a9c8 priority 10 0x7ffc012e81c0 M ProcessorScheduler class>idleProcess 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8200 I [] in ProcessorScheduler class>startUp 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8240 I [] in BlockClosure>newProcess 0x1d1aa40: a(n) BlockClosure
Most recent primitives relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: ...
Among the suspended processes is the UI process, which is waiting on a Delay. So, there may be issues on the image side with Delays.
Process 0x1c9a790 priority 40 0x7ffc012e60d8 M [] in Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6118 M BlockClosure>ifCurtailed: 0x11b1228: a(n) BlockClosure 0x7ffc012e6150 M Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6190 M WorldState>doOneCycleFor: 0x15cd6e0: a(n) WorldState 0x7ffc012e61c8 M PasteUpMorph>doOneCycle 0x165f890: a(n) PasteUpMorph 0x7ffc012e6200 M [] in ProjectX class>spawnNewMorphicProcessFor: 0x15d7900: a(n) ProjectX class 0x7ffc012e6240 I [] in BlockClosure>newProcess 0x1d1a940: a(n) BlockClosure
I don't see any sign of stack overflow you mentioned.
Levente
On Sun, Jan 29, 2017 at 7:02 PM, Levente Uzonyi leves@caesar.elte.hu wrote:
Hi Dan,
The expected output is in the file you attached. There seem to be no active processes to run, so only the idle process is being executed:
All Smalltalk process stacks (active first): Process 0x1c9a9c8 priority 10 0x7ffc012e81c0 M ProcessorScheduler class>idleProcess 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8200 I [] in ProcessorScheduler class>startUp 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8240 I [] in BlockClosure>newProcess 0x1d1aa40: a(n) BlockClosure
Most recent primitives relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: ...
Among the suspended processes is the UI process, which is waiting on a Delay. So, there may be issues on the image side with Delays.
Process 0x1c9a790 priority 40 0x7ffc012e60d8 M [] in Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6118 M BlockClosure>ifCurtailed: 0x11b1228: a(n) BlockClosure 0x7ffc012e6150 M Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6190 M WorldState>doOneCycleFor: 0x15cd6e0: a(n) WorldState
So as part of exploring the problem, an awful hack would be deleting that #wait from #dowOneCycleFor:.
cheers -ben
0x7ffc012e61c8 M PasteUpMorph>doOneCycle 0x165f890: a(n) PasteUpMorph 0x7ffc012e6200 M [] in ProjectX class>spawnNewMorphicProcessFor:
0x15d7900: a(n) ProjectX class 0x7ffc012e6240 I [] in BlockClosure>newProcess 0x1d1a940: a(n) BlockClosure
I don't see any sign of stack overflow you mentioned.
Levente
On 01/29/2017 07:48 AM, Ben Coman wrote:
On Sun, Jan 29, 2017 at 7:02 PM, Levente Uzonyi <leves@caesar.elte.hu mailto:leves@caesar.elte.hu> wrote:
Hi Dan, The expected output is in the file you attached. There seem to be no active processes to run, so only the idle process is being executed: All Smalltalk process stacks (active first): Process 0x1c9a9c8 priority 10 0x7ffc012e81c0 M ProcessorScheduler class>idleProcess 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8200 I [] in ProcessorScheduler class>startUp 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8240 I [] in BlockClosure>newProcess 0x1d1aa40: a(n) BlockClosure Most recent primitives relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: ... Among the suspended processes is the UI process, which is waiting on a Delay. So, there may be issues on the image side with Delays. Process 0x1c9a790 priority 40 0x7ffc012e60d8 M [] in Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6118 M BlockClosure>ifCurtailed: 0x11b1228: a(n) BlockClosure 0x7ffc012e6150 M Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6190 M WorldState>doOneCycleFor: 0x15cd6e0: a(n) WorldState
So as part of exploring the problem, an awful hack would be deleting that #wait from #dowOneCycleFor:.
Hi Ben,
I started the image by doing this, in effect:
/home/dan/cuis/cogspur64/squeak /home/dan/cuis/Cuis-Smalltalk-Dev/a64.image > freeze.log 2>&1
After loading packages containing my code I deleted:
(Delay for seconds: ... ) wait.
and ran for a while, did the 'pkill -USR1 -n -x squeak';
ran some more until it froze, did another 'pkill'.
The stdout looked like the other runs. There was no sign of stderr output.
- Dan
cheers -ben
0x7ffc012e61c8 M PasteUpMorph>doOneCycle 0x165f890: a(n) PasteUpMorph 0x7ffc012e6200 M [] in ProjectX class>spawnNewMorphicProcessFor: 0x15d7900: a(n) ProjectX class 0x7ffc012e6240 I [] in BlockClosure>newProcess 0x1d1a940: a(n) BlockClosure I don't see any sign of stack overflow you mentioned. Levente
I'd rather check if the delay would be longer than a second. If it would, then log the value and trim it to a second.
Levente
On Sun, Jan 29, 2017 at 6:35 PM, Levente Uzonyi leves@caesar.elte.hu wrote:
I'd rather check if the delay would be longer than a second. If it would, then log the value and trim it to a second.
Does Cmd-. normally work on this VM+image if the UI process is waiting on a Delay?
- Bert -
Hi Levente and Bert,
The arg to #forSeconds: is 0.4 and Cmd-. does not interrupt this image during a Delay.
- Dan
On 01/29/2017 12:58 PM, Bert Freudenberg wrote:
On Sun, Jan 29, 2017 at 6:35 PM, Levente Uzonyi <leves@caesar.elte.hu mailto:leves@caesar.elte.hu> wrote:
I'd rather check if the delay would be longer than a second. If it would, then log the value and trim it to a second.
Does Cmd-. normally work on this VM+image if the UI process is waiting on a Delay?
- Bert -
On Sun, Jan 29, 2017 at 8:27 PM, Dan Norton dnorton@mindspring.com wrote:
Cmd-. does not interrupt this image during a Delay.
Hmm, that might be worth investigating. It works fine with Cuis5.0-3040-spur-64 on a Mac VM.
- Bert -
On 01/29/2017 03:49 PM, Bert Freudenberg wrote:
On Sun, Jan 29, 2017 at 8:27 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
Cmd-. does not interrupt this image during a Delay.
Hmm, that might be worth investigating. It works fine with Cuis5.0-3040-spur-64 on a Mac VM.
Correction: fat fingers here was holding down the wrong key. Cmd-. (as in alt-.) does interrupt this image.
- Dan
On Sun, Jan 29, 2017 at 10:31 PM, Dan Norton dnorton@mindspring.com wrote:
On 01/29/2017 03:49 PM, Bert Freudenberg wrote:
On Sun, Jan 29, 2017 at 8:27 PM, Dan Norton dnorton@mindspring.com wrote:
Cmd-. does not interrupt this image during a Delay.
Hmm, that might be worth investigating. It works fine with Cuis5.0-3040-spur-64 on a Mac VM.
Correction: fat fingers here was holding down the wrong key. Cmd-. (as in alt-.) does interrupt this image.
- Dan
Good. At least a modest amount of sanity prevails ;)
But you're saying that in the abnormal case, alt-. does not work, yes?
- Bert -
On 01/29/2017 06:13 PM, Bert Freudenberg wrote:
On Sun, Jan 29, 2017 at 10:31 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
On 01/29/2017 03:49 PM, Bert Freudenberg wrote:
On Sun, Jan 29, 2017 at 8:27 PM, Dan Norton <dnorton@mindspring.com <mailto:dnorton@mindspring.com>> wrote: Cmd-. does not interrupt this image during a Delay. Hmm, that might be worth investigating. It works fine with Cuis5.0-3040-spur-64 on a Mac VM.
Correction: fat fingers here was holding down the wrong key. Cmd-. (as in alt-.) does interrupt this image. - Dan
Good. At least a modest amount of sanity prevails ;)
But you're saying that in the abnormal case, alt-. does not work, yes?
When the freeze happens, alt-. has no effect.
- Dan
The plot thickens, without clarity...
On two occasions when the freeze occurred, things locked up to the extent that the system pointer in Debian could not be moved. Some keyboard events were accepted:
ctl-esc followed by arrow keys
alt-space followed by arrow keys
I noticed that 'pkill -USR1 -n -x squeak' will append to crash.dmp in the Cuis-Smalltalk-Dev directory. No redirection of stdout needed. In the last run there were 4 stack overflows. Are these of any relevance?
copyBits @ @ @ @ @ @ @ @ @ @ fractionPart truncated fractionPart truncated @ **StackOverflow**
copyBits @ @ @ @ @ @ @ @ @ @ fractionPart truncated fractionPart truncated @ **StackOverflow**
scanCharactersFrom:to:in:rightX:stopConditions:kern: @ fractionPart truncated fractionPart truncated @ **StackOverflow**
scanCharactersFrom:to:in:rightX:stopConditions:kern: @ fractionPart truncated fractionPart truncated @ **StackOverflow**
- Dan
On 01/29/2017 08:33 PM, Dan Norton wrote:
The plot thickens, without clarity...
On two occasions when the freeze occurred, things locked up to the extent that the system pointer in Debian could not be moved. Some keyboard events were accepted:
ctl-esc followed by arrow keys
alt-space followed by arrow keys
The system pointer problem was due to a defective mouse. Unfortunately replacing it did not solve the freeze. After a freeze, I find that pkill will cause output to crash.dmp and the latest is attached.
- Dan
On Mon, Jan 30, 2017 at 5:48 PM, Dan Norton dnorton@mindspring.com wrote:
After a freeze, I find that pkill will cause output to crash.dmp and the latest is attached.
That looks benign.
How about attaching strace when frozen?
sudo strace -p <pid>
This should log all system calls the VM is still doing. Then press alt-. and see if the VM processes that.
And if all else fails, you may have to whip out gdb to see what exactly is going on.
- Bert -
On 01/30/2017 11:58 AM, Bert Freudenberg wrote:
On Mon, Jan 30, 2017 at 5:48 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
After a freeze, I find that pkill will cause output to crash.dmp and the latest is attached.
That looks benign.
How about attaching strace when frozen?
sudo strace -p <pid>
This should log all system calls the VM is still doing. Then press alt-. and see if the VM processes that.
# strace -p 2475
produces lots of... nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) ^Cnanosleep({0, 0}, Process 2475 detached <detached ...>
... before my packages are loaded as well as after the freeze. Identical AFAICT. alt-. is not processed after the freeze.
- Dan
When a freeze occurs, it's usually(tm) during an animation, probably because the movement of an ellipse is done in small increments of 92 per hop. This is within a block like so:
self when: #morphicStep evaluate: [ :delta | <movement> ].
self startSteppingStepTime: step.
How can the breakpoint be set in this block for gdb? We need source for some C module.
- Dan
On 01/30/2017 02:58 PM, Dan Norton wrote:
On 01/30/2017 11:58 AM, Bert Freudenberg wrote:
On Mon, Jan 30, 2017 at 5:48 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
After a freeze, I find that pkill will cause output to crash.dmp and the latest is attached.
That looks benign.
How about attaching strace when frozen?
sudo strace -p <pid>
This should log all system calls the VM is still doing. Then press alt-. and see if the VM processes that.
# strace -p 2475
produces lots of... nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) ^Cnanosleep({0, 0}, Process 2475 detached <detached ...>
... before my packages are loaded as well as after the freeze. Identical AFAICT. alt-. is not processed after the freeze.
- Dan
On Mon, Jan 30, 2017 at 9:43 PM, Dan Norton dnorton@mindspring.com wrote:
When a freeze occurs, it's usually(tm) during an animation, probably because the movement of an ellipse is done in small increments of 92 per hop. This is within a block like so:
self when: #morphicStep evaluate: [ :delta | <movement> ].
self startSteppingStepTime: step.
How can the breakpoint be set in this block for gdb? We need source for some C module.
With gdb you can't debug at the Smalltalk level (that's what alt-. is for). You can only attach to the process, break into it and figure out what's going on. Which may be very hard. Also, you may have to compile your own VM with debug info (or do we have pre-built ones somewhrere?)
- Bert -
Attached is a few lines of code which can reproduce the problem. File it in then do:
Tokenish start
in a Cuis workspace. You may have to let it run for a while, the time to failure is not predictable. Starting multiples of them may bring the freeze sooner.
- Dan
On 01/30/2017 04:23 PM, Bert Freudenberg wrote:
On Mon, Jan 30, 2017 at 9:43 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
When a freeze occurs, it's usually(tm) during an animation, probably because the movement of an ellipse is done in small increments of 92 per hop. This is within a block like so: self when: #morphicStep evaluate: [ :delta | <movement> ]. self startSteppingStepTime: step. How can the breakpoint be set in this block for gdb? We need source for some C module.
With gdb you can't debug at the Smalltalk level (that's what alt-. is for). You can only attach to the process, break into it and figure out what's going on. Which may be very hard. Also, you may have to compile your own VM with debug info (or do we have pre-built ones somewhrere?)
- Bert -
Hi Folks,
On Debian8, using cog_linux64x64_squeak.cog.spur_201702021058 and Cuis5.0-3040-spur-64 updated to 3043, everything will stop, including the clock in the Cuis task bar. Neither mouse nor keyboard events are processed. Trying to interrupt with alt-. is unsuccessful. CogSpur was re-cloned and the test case was re-run with the same result.
On Windows 10, using cog_win32x86_squeak.cog.spur_201702021058, and Cuis 5.0 3043, the same test case runs for hours with no problem.
Before the freeze, %CPU as reported by 'top' is about 4.
When the freeze occurs, %CPU as reported by 'top' rises to 11.3 which reminds me of a loop.
Why would the problem show up on Debian8 but not Win10?
- Dan
On 01/30/2017 08:35 PM, Dan Norton wrote:
Attached is a few lines of code which can reproduce the problem. File it in then do:
Tokenish start
in a Cuis workspace. You may have to let it run for a while, the time to failure is not predictable. Starting multiples of them may bring the freeze sooner.
- Dan
On 01/30/2017 04:23 PM, Bert Freudenberg wrote:
On Mon, Jan 30, 2017 at 9:43 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
When a freeze occurs, it's usually(tm) during an animation, probably because the movement of an ellipse is done in small increments of 92 per hop. This is within a block like so: self when: #morphicStep evaluate: [ :delta | <movement> ]. self startSteppingStepTime: step. How can the breakpoint be set in this block for gdb? We need source for some C module.
With gdb you can't debug at the Smalltalk level (that's what alt-. is for). You can only attach to the process, break into it and figure out what's going on. Which may be very hard. Also, you may have to compile your own VM with debug info (or do we have pre-built ones somewhrere?)
- Bert -
On Sat, Feb 4, 2017 at 7:09 PM, Dan Norton dnorton@mindspring.com wrote:
Hi Folks,
On Debian8, using cog_linux64x64_squeak.cog.spur_201702021058 and Cuis5.0-3040-spur-64 updated to 3043, everything will stop, including the clock in the Cuis task bar. Neither mouse nor keyboard events are processed. Trying to interrupt with alt-. is unsuccessful. CogSpur was re-cloned and the test case was re-run with the same result.
On Windows 10, using cog_win32x86_squeak.cog.spur_201702021058, and Cuis 5.0 3043, the same test case runs for hours with no problem.
Freezes on cog_macos64x64_squeak.cog.spur_201702021058 too. Maybe try the 64 bit windows VM?
- Bert -
On 06/02/2017 01:17 p.m., Bert Freudenberg wrote:
On Sat, Feb 4, 2017 at 7:09 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
Hi Folks, On Debian8, using cog_linux64x64_squeak.cog.spur_201702021058 and Cuis5.0-3040-spur-64 updated to 3043, everything will stop, including the clock in the Cuis task bar. Neither mouse nor keyboard events are processed. Trying to interrupt with alt-. is unsuccessful. CogSpur was re-cloned and the test case was re-run with the same result. On Windows 10, using cog_win32x86_squeak.cog.spur_201702021058, and Cuis 5.0 3043, the same test case runs for hours with no problem.
Freezes on cog_macos64x64_squeak.cog.spur_201702021058 too. Maybe try the 64 bit windows VM?
- Bert -
Yes, we'd try that. I'm out of town, and will not be able to do it this week.
If anyone wants to try, when I tried Spur64 in Windows the image would crash. The workaround is to clear the World background (an ivar in WorldState). There seems to be some problem with jpeg decompression on that VM.
Thanks,
2017-02-07 1:42 GMT+01:00 Juan Vuletich JuanVuletich@zoho.com:
On 06/02/2017 01:17 p.m., Bert Freudenberg wrote:
On Sat, Feb 4, 2017 at 7:09 PM, Dan Norton dnorton@mindspring.com wrote:
Hi Folks,
On Debian8, using cog_linux64x64_squeak.cog.spur_201702021058 and Cuis5.0-3040-spur-64 updated to 3043, everything will stop, including the clock in the Cuis task bar. Neither mouse nor keyboard events are processed. Trying to interrupt with alt-. is unsuccessful. CogSpur was re-cloned and the test case was re-run with the same result.
On Windows 10, using cog_win32x86_squeak.cog.spur_201702021058, and Cuis 5.0 3043, the same test case runs for hours with no problem.
Freezes on cog_macos64x64_squeak.cog.spur_201702021058 too. Maybe try the 64 bit windows VM?
- Bert -
Yes, we'd try that. I'm out of town, and will not be able to do it this week.
If anyone wants to try, when I tried Spur64 in Windows the image would crash. The workaround is to clear the World background (an ivar in WorldState). There seems to be some problem with jpeg decompression on that VM.
Thanks,
Yes, jpeg is broken in Win64, but works fine in MacOSX 64 and Linux 64 with same sources. I bet that it's related to LLP64... I tired to update to latest available code for this jpeg library, but it does not fix anything (and would introduce controversial deviations to the standard). We could comment out the generation of jpeg plugin in the interim
-- Juan Vuletichwww.cuis-smalltalk.orghttps://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
2017-02-07 11:09 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-02-07 1:42 GMT+01:00 Juan Vuletich JuanVuletich@zoho.com:
On 06/02/2017 01:17 p.m., Bert Freudenberg wrote:
On Sat, Feb 4, 2017 at 7:09 PM, Dan Norton dnorton@mindspring.com wrote:
Hi Folks,
On Debian8, using cog_linux64x64_squeak.cog.spur_201702021058 and Cuis5.0-3040-spur-64 updated to 3043, everything will stop, including the clock in the Cuis task bar. Neither mouse nor keyboard events are processed. Trying to interrupt with alt-. is unsuccessful. CogSpur was re-cloned and the test case was re-run with the same result.
On Windows 10, using cog_win32x86_squeak.cog.spur_201702021058, and Cuis 5.0 3043, the same test case runs for hours with no problem.
Freezes on cog_macos64x64_squeak.cog.spur_201702021058 too. Maybe try the 64 bit windows VM?
- Bert -
Yes, we'd try that. I'm out of town, and will not be able to do it this week.
If anyone wants to try, when I tried Spur64 in Windows the image would crash. The workaround is to clear the World background (an ivar in WorldState). There seems to be some problem with jpeg decompression on that VM.
Thanks,
Yes, jpeg is broken in Win64, but works fine in MacOSX 64 and Linux 64 with same sources. I bet that it's related to LLP64... I tired to update to latest available code for this jpeg library, but it does not fix anything (and would introduce controversial deviations to the standard). We could comment out the generation of jpeg plugin in the interim
Follow-up: jpeg should be fixed for Win64 in latest builds on the server. It was not related to LLP64 but rather a problem of jump_buf alignment. https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/119
--
Juan Vuletichwww.cuis-smalltalk.orghttps://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
Hi Folks,
It seems we found a problem in Spur-64 on Linux. Eliot & VM developers and experts, please, we need your help.
Cuis users have been experiencing this problem since we started using Spur-64. At random times, the image just stops working. No Smalltalk process is run, and the VM needs to be killed from the OS.
In my experience, most of the time it takes several hours for the problem to happen. Dan found a simple way to make it happen rather quickly. No FFI is involved, just pretty basic Smalltalk. I can confirm that the problem does not happen using Spur-32 on the same platform.
Thanks,
On 30/01/2017 10:35 p.m., Dan Norton wrote:
Attached is a few lines of code which can reproduce the problem. File it in then do:
Tokenish start
in a Cuis workspace. You may have to let it run for a while, the time to failure is not predictable. Starting multiples of them may bring the freeze sooner.
- Dan
On 01/30/2017 04:23 PM, Bert Freudenberg wrote:
On Mon, Jan 30, 2017 at 9:43 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
When a freeze occurs, it's usually(tm) during an animation, probably because the movement of an ellipse is done in small increments of 92 per hop. This is within a block like so: self when: #morphicStep evaluate: [ :delta | <movement> ]. self startSteppingStepTime: step. How can the breakpoint be set in this block for gdb? We need source for some C module.
With gdb you can't debug at the Smalltalk level (that's what alt-. is for). You can only attach to the process, break into it and figure out what's going on. Which may be very hard. Also, you may have to compile your own VM with debug info (or do we have pre-built ones somewhrere?)
- Bert -
On Sun, Feb 05, 2017 at 04:09:39PM -0300, Juan Vuletich wrote:
Hi Folks,
It seems we found a problem in Spur-64 on Linux. Eliot & VM developers and experts, please, we need your help.
Cuis users have been experiencing this problem since we started using Spur-64. At random times, the image just stops working. No Smalltalk process is run, and the VM needs to be killed from the OS.
In my experience, most of the time it takes several hours for the problem to happen. Dan found a simple way to make it happen rather quickly. No FFI is involved, just pretty basic Smalltalk. I can confirm that the problem does not happen using Spur-32 on the same platform.
I can confirm this on my Ubuntu laptop running Cuis5.0-3040-spur-64.image.
I filed in the Tokenish.st demo, and started several Tokenish morphs that bounced along cheerfully until the image hung up.
The VM is still running and using CPU. Top shows about 5% CPU, which several times greater than for an idle image, but certainly not abnormally high.
The actual VM process shows a "waiting" state in /proc/<pid>/status, which is exactly like what I would expect for a normal running VM. Here is a sample of what I see for the hung image/VM:
lewis@lewis-Gazelle-Pro:/proc/32609$ cat /proc/32609/status Name: squeak State: S (sleeping) Tgid: 32609 Ngid: 0 Pid: 32609 PPid: 2779 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 4 24 27 30 46 115 131 1000 VmPeak: 147456 kB VmSize: 131156 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 40468 kB VmRSS: 40464 kB VmData: 57496 kB VmStk: 1740 kB VmExe: 1304 kB VmLib: 5916 kB VmPTE: 240 kB VmSwap: 0 kB Threads: 2 SigQ: 0/31381 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000190000648 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 14946247 nonvoluntary_ctxt_switches: 2443 lewis@lewis-Gazelle-Pro:/proc/32609$
The VM that I am using is:
Virtual Machine --------------- /usr/local/lib/squeak/5.0-201608171728-64/squeak Croquet Closure Cog[Spur] VM [CoInterpreterPrimitives VMMaker.oscog-cb.1919] Unix built on Sep 25 2016 15:59:10 Compiler: 4.6.3 platform sources revision VM: 201608171728 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Wed Aug 17 10:28:01 2016 -0700 $ Plugins: 201608171728 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ CoInterpreter VMMaker.oscog-cb.1919 uuid: 00a8dd2a-bc8d-4552-b400-be781c8aabec Sep 25 2016 StackToRegisterMappingCogit VMMaker.oscog-cb.1919 uuid: 00a8dd2a-bc8d-4552-b400-be781c8aabec Sep 25 2016
I cannot see what is wrong, but maybe this will prompt some ideas.
Dave
When the hang-up occurs, a #morphicStep is involved. I've never seen it otherwise.
Attached is what may be the minimal test case: an empty step loop. File in the attachment and in a workspace do:
Minish start
- Dan
On 02/05/2017 03:42 PM, David T. Lewis wrote:
On Sun, Feb 05, 2017 at 04:09:39PM -0300, Juan Vuletich wrote:
Hi Folks,
It seems we found a problem in Spur-64 on Linux. Eliot & VM developers and experts, please, we need your help.
Cuis users have been experiencing this problem since we started using Spur-64. At random times, the image just stops working. No Smalltalk process is run, and the VM needs to be killed from the OS.
In my experience, most of the time it takes several hours for the problem to happen. Dan found a simple way to make it happen rather quickly. No FFI is involved, just pretty basic Smalltalk. I can confirm that the problem does not happen using Spur-32 on the same platform.
I can confirm this on my Ubuntu laptop running Cuis5.0-3040-spur-64.image.
I filed in the Tokenish.st demo, and started several Tokenish morphs that bounced along cheerfully until the image hung up.
The VM is still running and using CPU. Top shows about 5% CPU, which several times greater than for an idle image, but certainly not abnormally high.
The actual VM process shows a "waiting" state in /proc/<pid>/status, which is exactly like what I would expect for a normal running VM. Here is a sample of what I see for the hung image/VM:
lewis@lewis-Gazelle-Pro:/proc/32609$ cat /proc/32609/status Name: squeak State: S (sleeping) Tgid: 32609 Ngid: 0 Pid: 32609 PPid: 2779 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 4 24 27 30 46 115 131 1000 VmPeak: 147456 kB VmSize: 131156 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 40468 kB VmRSS: 40464 kB VmData: 57496 kB VmStk: 1740 kB VmExe: 1304 kB VmLib: 5916 kB VmPTE: 240 kB VmSwap: 0 kB Threads: 2 SigQ: 0/31381 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000190000648 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 14946247 nonvoluntary_ctxt_switches: 2443 lewis@lewis-Gazelle-Pro:/proc/32609$
The VM that I am using is:
Virtual Machine
/usr/local/lib/squeak/5.0-201608171728-64/squeak Croquet Closure Cog[Spur] VM [CoInterpreterPrimitives VMMaker.oscog-cb.1919] Unix built on Sep 25 2016 15:59:10 Compiler: 4.6.3 platform sources revision VM: 201608171728 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Wed Aug 17 10:28:01 2016 -0700 $ Plugins: 201608171728 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ CoInterpreter VMMaker.oscog-cb.1919 uuid: 00a8dd2a-bc8d-4552-b400-be781c8aabec Sep 25 2016 StackToRegisterMappingCogit VMMaker.oscog-cb.1919 uuid: 00a8dd2a-bc8d-4552-b400-be781c8aabec Sep 25 2016
I cannot see what is wrong, but maybe this will prompt some ideas.
Dave
On Sun, Feb 05, 2017 at 04:51:06PM -0500, Dan Norton wrote:
When the hang-up occurs, a #morphicStep is involved. I've never seen it otherwise.
Attached is what may be the minimal test case: an empty step loop. File in the attachment and in a workspace do:
Minish start
- Dan
So far I have not been able to reproduce the hangup with the Minish example. I've have about 20 of them open for about a half hour or so. Maybe it's just really intermittent.
Dave
On Sun, Feb 05, 2017 at 08:34:02PM -0500, David T. Lewis wrote:
On Sun, Feb 05, 2017 at 04:51:06PM -0500, Dan Norton wrote:
When the hang-up occurs, a #morphicStep is involved. I've never seen it otherwise.
Attached is what may be the minimal test case: an empty step loop. File in the attachment and in a workspace do:
Minish start
- Dan
So far I have not been able to reproduce the hangup with the Minish example. I've have about 20 of them open for about a half hour or so. Maybe it's just really intermittent.
I ran for over an hour with the Minish morphs going, no problem. But I can get the image to hang in less than a minute by opening a bunch of bouncing Tokenish morphs.
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak. Maybe someone with more experience in Morphic can try it.
Levente's suggestion of starting up a REPL loop to retain access to the image is a good idea too.
Dave
On 02/05/2017 09:15 PM, David T. Lewis wrote:
On Sun, Feb 05, 2017 at 08:34:02PM -0500, David T. Lewis wrote:
On Sun, Feb 05, 2017 at 04:51:06PM -0500, Dan Norton wrote:
When the hang-up occurs, a #morphicStep is involved. I've never seen it otherwise.
Attached is what may be the minimal test case: an empty step loop. File in the attachment and in a workspace do:
Minish start
- Dan
So far I have not been able to reproduce the hangup with the Minish example. I've have about 20 of them open for about a half hour or so. Maybe it's just really intermittent.
I ran for over an hour with the Minish morphs going, no problem. But I can get the image to hang in less than a minute by opening a bunch of bouncing Tokenish morphs.
Hi Dave,
You called it right: really intermittent with Minish. The first time I ran it, it hung in less than one minute - right after a print-it for "Time now print24" which I added to the workspace. But after that, I ran 7 of them for hours with no hang.
Attached is a package of test cases I've been using to try to determine what does and does not hang - starting with Tokenish and down to Minish, each one taking something out of the loop. Maybe something in there will be useful.
- Dan
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com wrote:
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64.
Thanks, Juan Sent using Zoho Mail
---- On Mon, 06 Feb 2017 09:19:48 -0300 bert@freudenbergs.de wrote ----
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com wrote: It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich juanvuletich@zoho.com wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64.
Thanks, Juan
Sent using Zoho Mail https://www.zoho.com/mail/
---- On Mon, 06 Feb 2017 09:19:48 -0300 * bert@freudenbergs.de bert@freudenbergs.de * wrote ----
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com wrote:
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping.
- Bert -
Thanks Bert. So the problem is not Linux specific, but indeed Spur64 and Cuis specific.
Unfortunately (well, actually fortunately!) I'm on vacations and I don't have any x64 machine available so I can't do any testing right now. I'll be back home in about a week.
Thanks,
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
Hi Folks,
Today, I was able to get closer to the origin of the problem. The problem is when Morphic tries to do a Delay (inter-cycle-pause) of 1 or two milliseconds. (Usually they are quite larger, and the problem does not appear). With the 64-bit VM, sometimes very short delays hang the complete system. I don't know why, but this:
waitDelay _ Delay forMilliseconds: 50. [ true ] whileTrue: [ 1000 atRandom print. waitDelay setDelay: 1; wait ].
is enough to hang Cuis in a short time. When trying to reproduce the problem in Squeak I had mixed results. It doesn't seem to hang there, but if I let this run for a couple of minutes and then try to halt with alt-. , then the hang occurs, and Squeak becomes irresponsibe. As a workaround, in Cuis in the inter cycle pause, I'm avoiding very short delays.
I'm not sure if the bug is in the VM, in the image, or both, but it is still there, and it also affects Squeak. I haven't tried it, but Pharo might also have the problem (if not, it can show a possible solution).
BTW, Bert, I guess Squeak never does short delays in #interCyclePause: ...
Thanks,
On 2/6/2017 12:28 PM, Bert Freudenberg wrote:
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich <juanvuletich@zoho.com mailto:juanvuletich@zoho.com> wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64. Thanks, Juan Sent using Zoho Mail <https://www.zoho.com/mail/> ---- On Mon, 06 Feb 2017 09:19:48 -0300 *bert@freudenbergs.de <mailto:bert@freudenbergs.de> * wrote ---- On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis <lewis@mail.msen.com <mailto:lewis@mail.msen.com>> wrote: It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak. See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
I finally understood what was happening. The problem with is primitive 136:
primSignal: aSemaphore atMilliseconds: aSmallInteger "Signal the semaphore when the millisecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil. Essential. See Object documentation whatIsAPrimitive." <primitive: 136> ^self primitiveFailed
If the requested tick is already in the past, all the VMs except for Spur-64 will signal the semaphore anyway. Spur-64 won't. Taking care of this in the Cuis image was easy enough.
Squeak seems not to be affected because it uses a newer primitive, that takes the large integer microsecond ticker, that might have the older behavior (signaling anyway). I'm not really sure, and maybe someone could check the primitive behavior and the corresponding assumptions in Squeak.
Older Squeak images are not affected, as they are all 32-bit. So maybe there's no action required. In any case, making the behavior consistent across VMs and in both primitives would be nice.
Cheers,
On 3/3/2017 3:55 PM, Juan Vuletich wrote:
Hi Folks,
Today, I was able to get closer to the origin of the problem. The problem is when Morphic tries to do a Delay (inter-cycle-pause) of 1 or two milliseconds. (Usually they are quite larger, and the problem does not appear). With the 64-bit VM, sometimes very short delays hang the complete system. I don't know why, but this:
waitDelay _ Delay forMilliseconds: 50. [ true ] whileTrue: [ 1000 atRandom print. waitDelay setDelay: 1; wait ].
is enough to hang Cuis in a short time. When trying to reproduce the problem in Squeak I had mixed results. It doesn't seem to hang there, but if I let this run for a couple of minutes and then try to halt with alt-. , then the hang occurs, and Squeak becomes irresponsibe. As a workaround, in Cuis in the inter cycle pause, I'm avoiding very short delays.
I'm not sure if the bug is in the VM, in the image, or both, but it is still there, and it also affects Squeak. I haven't tried it, but Pharo might also have the problem (if not, it can show a possible solution).
BTW, Bert, I guess Squeak never does short delays in #interCyclePause: ...
Thanks,
On 2/6/2017 12:28 PM, Bert Freudenberg wrote:
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich <juanvuletich@zoho.com mailto:juanvuletich@zoho.com> wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64. Thanks, Juan Sent using Zoho Mail <https://www.zoho.com/mail/> ---- On Mon, 06 Feb 2017 09:19:48 -0300 *bert@freudenbergs.de <mailto:bert@freudenbergs.de> * wrote ---- On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis <lewis@mail.msen.com <mailto:lewis@mail.msen.com>> wrote: It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak. See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
-- Juan Vuletich www.cuis-smalltalk.org https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but on a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
Dave
On Mon, Mar 06, 2017 at 11:06:22AM -0300, Juan Vuletich wrote:
I finally understood what was happening. The problem with is primitive 136:
primSignal: aSemaphore atMilliseconds: aSmallInteger "Signal the semaphore when the millisecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil. Essential. See Object documentation whatIsAPrimitive." <primitive: 136> ^self primitiveFailed
If the requested tick is already in the past, all the VMs except for Spur-64 will signal the semaphore anyway. Spur-64 won't. Taking care of this in the Cuis image was easy enough.
Squeak seems not to be affected because it uses a newer primitive, that takes the large integer microsecond ticker, that might have the older behavior (signaling anyway). I'm not really sure, and maybe someone could check the primitive behavior and the corresponding assumptions in Squeak.
Older Squeak images are not affected, as they are all 32-bit. So maybe there's no action required. In any case, making the behavior consistent across VMs and in both primitives would be nice.
Cheers,
On 3/3/2017 3:55 PM, Juan Vuletich wrote:
Hi Folks,
Today, I was able to get closer to the origin of the problem. The problem is when Morphic tries to do a Delay (inter-cycle-pause) of 1 or two milliseconds. (Usually they are quite larger, and the problem does not appear). With the 64-bit VM, sometimes very short delays hang the complete system. I don't know why, but this:
waitDelay _ Delay forMilliseconds: 50. [ true ] whileTrue: [ 1000 atRandom print. waitDelay setDelay: 1; wait ].
is enough to hang Cuis in a short time. When trying to reproduce the problem in Squeak I had mixed results. It doesn't seem to hang there, but if I let this run for a couple of minutes and then try to halt with alt-. , then the hang occurs, and Squeak becomes irresponsibe. As a workaround, in Cuis in the inter cycle pause, I'm avoiding very short delays.
I'm not sure if the bug is in the VM, in the image, or both, but it is still there, and it also affects Squeak. I haven't tried it, but Pharo might also have the problem (if not, it can show a possible solution).
BTW, Bert, I guess Squeak never does short delays in #interCyclePause: ...
Thanks,
On 2/6/2017 12:28 PM, Bert Freudenberg wrote:
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich <juanvuletich@zoho.com mailto:juanvuletich@zoho.com> wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64.
Thanks, Juan
Sent using Zoho Mail https://www.zoho.com/mail/
---- On Mon, 06 Feb 2017 09:19:48 -0300 *bert@freudenbergs.de <mailto:bert@freudenbergs.de> * wrote ---- On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis <lewis@mail.msen.com <mailto:lewis@mail.msen.com>> wrote: It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak. See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
-- Juan Vuletich www.cuis-smalltalk.org https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
-- Juan Vuletich www.cuis-smalltalk.org https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
Hi Dave,
Thanks for answering. Inline.
On 3/6/2017 10:50 PM, David T. Lewis wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but on a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I wasn't aware of that, and had assumed that millisecond timer would use the whole SmallInteger range. This might introduce a bug, that would only appear at timer rollover, i.e. about 6 days after image startup. I'll fix this. Thanks.
But this is a completely separated issue. The problem we saw, the semaphore never being signaled if deadline is in the past, happens immediately after image startup.
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
Dave
Well, what follows is a way to test VM behavior. Tested in Cuis, but should be trivial to reproduce in Squeak, as it is a VM issue. I Cuis add (copied from Squeak):
!Time class methodsFor: 'general inquiries' stamp: 'jmv 3/7/2017 08:58:12'! utcMicrosecondClock "Answer the UTC microseconds since the Smalltalk epoch (January 1st 1901, the start of the 20th century). The value is derived from the Posix epoch with a constant offset corresponding to elapsed microseconds between the two epochs according to RFC 868." <primitive: 240> ^0! !
!Delay class methodsFor: 'primitives' stamp: 'jmv 3/7/2017 08:57:45'! primSignal: aSemaphore atUTCMicroseconds: anInteger "Signal the semaphore when the UTC microsecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil, or if the second argument is not an integer. Essential. See Object documentation whatIsAPrimitive." <primitive: 242> ^self primitiveFailed! !
Then, in a Workspace, try the following 4 doits:
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock + 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time millisecondClockValue + 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock - 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time millisecondClockValue - 10. s wait. 'Not OK at all' print.
On Spur32, all 4 finish immediately. On Spur64, the first 3 also finish immediately, but the fourth freezes the image. The difference in behavior between Spur32 and Spur64 (on Linux) is indeed there.
Ok. Also tried Squeak (note that instead of #millisecondClockValue in Squeak it is #primMillisecondClock) :
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock + 10. s wait. 'Ok'.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock + 10. s wait. 'Ok'.
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock - 10. s wait. 'Ok'.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10. s wait. 'Not OK at all'.
Exactly the same behavior.
I just took a look at static void primitiveSignalAtMilliseconds(void) in https://raw.githubusercontent.com/OpenSmalltalk/opensmalltalk-vm/Cog/src/vm/... The only thing I see is that msecs is an usqInt and deltaMsecs is an sqInt. But I'm not good enough at gcc subtleties to say if this matters at all. I mean, it looks as if 'if (deltaMsecs < 0) {' was true on Spur64 and false on Spur32... Or maybe the difference is in the handling of nextWakeupUsecs ...
In any case, it looks like deadlines in the past are not supported (as code assumes they are because of rollover...)
Thanks,
Hi Juan,
I confirm your Cuis test results on Squeak using several image/VM combinations, details below.
On Tue, Mar 07, 2017 at 09:49:53AM -0300, Juan Vuletich wrote:
Hi Dave,
Thanks for answering. Inline.
On 3/6/2017 10:50 PM, David T. Lewis wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but on a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I wasn't aware of that, and had assumed that millisecond timer would use the whole SmallInteger range. This might introduce a bug, that would only appear at timer rollover, i.e. about 6 days after image startup. I'll fix this. Thanks.
But this is a completely separated issue. The problem we saw, the semaphore never being signaled if deadline is in the past, happens immediately after image startup.
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
Dave
Well, what follows is a way to test VM behavior. Tested in Cuis, but should be trivial to reproduce in Squeak, as it is a VM issue. I Cuis add (copied from Squeak):
I tried this with 4 different Squeak image/VM combinations:
- Squeak 3.8 with interpreter VM (an older image that uses millisecond clock for Delay)
- Squeak trunk V3 image with interpreter VM (latest version image, but non-Spur, updated via www.squeaksource.com/TrunkUpdateStreamV3)
- Squeak trunk 32 bit Spur
- Squeak trunk 64 bit Spur
!Time class methodsFor: 'general inquiries' stamp: 'jmv 3/7/2017 08:58:12'! utcMicrosecondClock "Answer the UTC microseconds since the Smalltalk epoch (January 1st 1901, the start of the 20th century). The value is derived from the Posix epoch with a constant offset corresponding to elapsed microseconds between the two epochs according to RFC 868." <primitive: 240> ^0! !
!Delay class methodsFor: 'primitives' stamp: 'jmv 3/7/2017 08:57:45'! primSignal: aSemaphore atUTCMicroseconds: anInteger "Signal the semaphore when the UTC microsecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil, or if the second argument is not an integer. Essential. See Object documentation whatIsAPrimitive." <primitive: 242> ^self primitiveFailed! !
I tried adding these to my Squeak 3.8 image for the test. It does not work properly because the primitive table was different back then, and the interpreter VM is automatically adjusting for this so not calling primitive 240 (actually it calls the old #primitiveSerialPortWrite rather than #primitiveUTCMicrosecondClock that later replaced it).
Nevertheless, the primSignal:atMilliseconds: works, and there is no problem with a -10 parameter, so these are included marked in the results below.
I also note that I locked up the Squeak 3.8 image a couple of times while running various tests with bad input parameters. It is not reproduceable, but there may be something bad about calling #primSignal:atMilliseconds: in an image that is also using it for the Delay mechanism.
I also locked up a Spur 32 image when calling primSignal:atUTCMicroseconds: so this may be the same problem, it may not be safe to call this when the same method is being used for Delay handling.
Then, in a Workspace, try the following 4 doits:
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock + 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time millisecondClockValue + 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock - 10. s wait. 'Ok' print.
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time millisecondClockValue - 10. s wait. 'Not OK at all' print.
On Spur32, all 4 finish immediately. On Spur64, the first 3 also finish immediately, but the fourth freezes the image. The difference in behavior between Spur32 and Spur64 (on Linux) is indeed there.
Ok. Also tried Squeak (note that instead of #millisecondClockValue in Squeak it is #primMillisecondClock) :
Test results for my four Squeak image/VM combinations are added below.
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock + 10. s wait. 'Ok'.
Squeak 3.8 => OK Squeak trunk V3 interpreter => OK Squeak trunk Spur 32 => OK Squeak trunk Spur 64 => OK
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock + 10. s wait. 'Ok'.
Squeak 3.8 => OK Squeak trunk V3 interpreter => OK Squeak trunk Spur 32 => OK Squeak trunk Spur 64 => OK
s _ Semaphore new. Delay primSignal: s atUTCMicroseconds: Time utcMicrosecondClock - 10. s wait. 'Ok'.
Squeak 3.8 => primitive failed (but see note above for Squeak 3.8 using different primitive table) Squeak trunk V3 interpreter => OK Squeak trunk Spur 32 => OK Squeak trunk Spur 64 => OK
s _ Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10. s wait. 'Not OK at all'.
Squeak 3.8 => OK Squeak trunk V3 interpreter => OK Squeak trunk Spur 32 => OK Squeak trunk Spur 64 => Not OK at all, hangs image
Exactly the same behavior.
Confirmed.
I just took a look at static void primitiveSignalAtMilliseconds(void) in https://raw.githubusercontent.com/OpenSmalltalk/opensmalltalk-vm/Cog/src/vm/... The only thing I see is that msecs is an usqInt and deltaMsecs is an sqInt. But I'm not good enough at gcc subtleties to say if this matters at all. I mean, it looks as if 'if (deltaMsecs < 0) {' was true on Spur64 and false on Spur32... Or maybe the difference is in the handling of nextWakeupUsecs ...
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Unfortunately I cannot easily recompile to verify (build problems on my Ubuntu for Cog/Spur, sorry), but maybe someone else can take a look at this?
In any case, it looks like deadlines in the past are not supported (as code assumes they are because of rollover...)
I agree this looks like a bug in the 64 bit VMs. But I do not yet see the reason for it.
Dave
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but on a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5 hours ;-)
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and wrapping-immune primitiveSignalAtUTCMicroseconds
Dave
On Mon, Mar 06, 2017 at 11:06:22AM -0300, Juan Vuletich wrote:
I finally understood what was happening. The problem with is primitive
136:
primSignal: aSemaphore atMilliseconds: aSmallInteger "Signal the semaphore when the millisecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil. Essential. See Object documentation whatIsAPrimitive." <primitive: 136> ^self primitiveFailed
If the requested tick is already in the past, all the VMs except for Spur-64 will signal the semaphore anyway. Spur-64 won't. Taking care of this in the Cuis image was easy enough.
Squeak seems not to be affected because it uses a newer primitive, that takes the large integer microsecond ticker, that might have the older behavior (signaling anyway). I'm not really sure, and maybe someone could check the primitive behavior and the corresponding assumptions in Squeak.
Older Squeak images are not affected, as they are all 32-bit. So maybe there's no action required. In any case, making the behavior consistent across VMs and in both primitives would be nice.
Cheers,
On 3/3/2017 3:55 PM, Juan Vuletich wrote:
Hi Folks,
Today, I was able to get closer to the origin of the problem. The problem is when Morphic tries to do a Delay (inter-cycle-pause) of 1 or two milliseconds. (Usually they are quite larger, and the problem does not appear). With the 64-bit VM, sometimes very short delays hang the complete system. I don't know why, but this:
waitDelay _ Delay forMilliseconds: 50. [ true ] whileTrue: [ 1000 atRandom print. waitDelay setDelay: 1; wait ].
is enough to hang Cuis in a short time. When trying to reproduce the problem in Squeak I had mixed results. It doesn't seem to hang there, but if I let this run for a couple of minutes and then try to halt with alt-. , then the hang occurs, and Squeak becomes irresponsibe. As a workaround, in Cuis in the inter cycle pause, I'm avoiding very short delays.
I'm not sure if the bug is in the VM, in the image, or both, but it is still there, and it also affects Squeak. I haven't tried it, but Pharo might also have the problem (if not, it can show a possible solution).
BTW, Bert, I guess Squeak never does short delays in #interCyclePause:
...
Thanks,
On 2/6/2017 12:28 PM, Bert Freudenberg wrote:
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich <juanvuletich@zoho.com mailto:juanvuletich@zoho.com> wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64.
Thanks, Juan
Sent using Zoho Mail https://www.zoho.com/mail/
---- On Mon, 06 Feb 2017 09:19:48 -0300 *bert@freudenbergs.de <mailto:bert@freudenbergs.de> * wrote ---- On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis <lewis@mail.msen.com <mailto:lewis@mail.msen.com>> wrote: It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak. See attachment. Did not freeze on Mac with 10 Tokenish jumping. - Bert -
-- Juan Vuletich www.cuis-smalltalk.org https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
-- Juan Vuletich www.cuis-smalltalk.org https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but on a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5 hours ;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and wrapping-immune primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but
on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things: - in 32 bits it will overflow leading to a negative delay and a nextWakeupUsecs in the past (but that's invoking Undefined Behavior) - in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer
range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
but on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see
anything
in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and if we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers). Eventually, for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will overflow and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by this primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays, that isanthing above a few minutes and document it.
On Fri, Mar 10, 2017 at 5:19 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
reject negative delays
Yes. I'll trade you your negative delay for my perpetual motion machine.
cheers -ben
On 10-03-2017, at 7:13 AM, Ben Coman btc@openInWorld.com wrote:
On Fri, Mar 10, 2017 at 5:19 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
reject negative delays
Yes. I'll trade you your negative delay for my perpetual motion machine.
Hah! You can’t fool *me*! Negative delays would *make* a perpetual motion machine, at least according to this Ancient Scroll I got from my local friendly Ascended Master.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Alone together
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer
range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
but on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see
anything
in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and if we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers). Eventually, for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will overflow and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by this primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays, that isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can restate the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is slightly in the past (possibly due to clock skew or other timing issues), then the primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set deltaMSeconds to zero, in the case of deltaMSeconds being less than zero. We still need to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a small negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
Dave
As an aside, in the mists of time there was a question about whether the activedelay should be handled in the VM as a delay rather than an absolute time.
cheers -ben
On Sat, Mar 11, 2017 at 12:49 PM, David T. Lewis lewis@mail.msen.com wrote:
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer
range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
but on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see
anything
in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and if we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers). Eventually, for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will overflow and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by this primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays, that isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can restate the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is slightly in the past (possibly due to clock skew or other timing issues), then the primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set deltaMSeconds to zero, in the case of deltaMSeconds being less than zero. We still need to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a small negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
Dave
On Fri, Mar 10, 2017 at 11:49:43PM -0500, David T. Lewis wrote:
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer
range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
but on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see
anything
in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and if we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers). Eventually, for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will overflow and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by this primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays, that isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can restate the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is slightly in the past (possibly due to clock skew or other timing issues), then the primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set deltaMSeconds to zero, in the case of deltaMSeconds being less than zero. We still need to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a small negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
I now have a working Cog/Spur build on my Ubuntu box, so I was able to try a few things. It turns out that the arithmetic overflow that Nicolas spotted seems to fully account for the problem that Juan originally reported on the 64 but Cuis image. The overflow happened here:
nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
Declaring deltaMsecs as a 64 bit unsigned to match the declaration of nextWakeupUsecs resolved the image lockup problem that Juan identified. I committed the fix in VMMaker.oscog-dtl.2147.
Dave
Thanks David!
On Sun, Mar 12, 2017 at 3:52 PM, David T. Lewis lewis@mail.msen.com wrote:
On Fri, Mar 10, 2017 at 11:49:43PM -0500, David T. Lewis wrote:
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis <
lewis@mail.msen.com>
wrote:
> > In the VM, the millisecond clock wraps within the 32 bit integer
range:
> > #define MillisecondClockMask 0x1FFFFFFF > > In the Cuis image, Delay class>>handleTimerEvent does this: > > nextTick := nextTick min: SmallInteger maxVal. > > On a 64-bit Spur image, SmallInteger maxVal is
16rFFFFFFFFFFFFFFF,
but on
> a 32-bit image it is 16r3FFFFFFF. > > Could that be it? > > I don't really know how to test in Squeak. As you say, Squeak
is now
> using the microsecond clock in #handleTimerEvent. I do not see
anything
> in primitiveSignalAtMilliseconds that would behave any
differently on
> a 64 bit versus 32 bit image or VM, but I do not know how to
test to
> be sure. >
I bet that the following code from primitiveSignalAtMilliseconds
ends up
wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days
and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not
use
primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that
affects only
the 64-bit VM, and given that it expresses itself intermittently
and in
ways that affect only someone who is attempting to migrate their
existing V3
image to Spur, then I would say that it makes very good sense to take the
time
to fix it if we are able to do so. After all, there may be other
people
who will want to migrate V3 images to Spur, and there is no point in
making
the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment
please
check and see if what I said in an earlier email might make a
difference:
I see that ioMSecs() is declared as signed long (32 bits), but it
is
used in expression with a 64 bit usqInt. So maybe it needs a cast, or
maybe
the variables like msecs and deltaMsecs in
primitiveSignalAtMilliseconds
should be declared as 32 bit long and unsigned long to match the
actual
usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer,
thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
(deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which
is
HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
(deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then
(deltaMsecs *
- will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high, and
if
we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers).
Eventually,
for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will
overflow
and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by
this
primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays,
that
isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can restate the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is
slightly
in the past (possibly due to clock skew or other timing issues), then the primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set
deltaMSeconds
to zero, in the case of deltaMSeconds being less than zero. We still need to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a small negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
I now have a working Cog/Spur build on my Ubuntu box, so I was able to try a few things. It turns out that the arithmetic overflow that Nicolas spotted seems to fully account for the problem that Juan originally reported on the 64 but Cuis image. The overflow happened here:
nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
Declaring deltaMsecs as a 64 bit unsigned to match the declaration of nextWakeupUsecs resolved the image lockup problem that Juan identified. I committed the fix in VMMaker.oscog-dtl.2147.
Dave
Sorry to resurrect such an old thread but now that we declare usqLong deltaMsecs;
the compiler barks (and will eliminate dead code):
../../spurstacksrc/vm/gcc3x-interp.c:67892:19: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare] if (deltaMsecs < 0) {
So we might have to come back to this one.
2017-03-13 2:47 GMT+01:00 Eliot Miranda eliot.miranda@gmail.com:
Thanks David!
On Sun, Mar 12, 2017 at 3:52 PM, David T. Lewis lewis@mail.msen.com wrote:
On Fri, Mar 10, 2017 at 11:49:43PM -0500, David T. Lewis wrote:
On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
2017-03-10 9:45 GMT+01:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote: > > Hi David, > > On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis <
lewis@mail.msen.com>
wrote: > > > > > In the VM, the millisecond clock wraps within the 32 bit
integer
range: > > > > #define MillisecondClockMask 0x1FFFFFFF > > > > In the Cuis image, Delay class>>handleTimerEvent does this: > > > > nextTick := nextTick min: SmallInteger maxVal. > > > > On a 64-bit Spur image, SmallInteger maxVal is
16rFFFFFFFFFFFFFFF,
but on > > a 32-bit image it is 16r3FFFFFFF. > > > > Could that be it? > > > > I don't really know how to test in Squeak. As you say, Squeak
is now
> > using the microsecond clock in #handleTimerEvent. I do not see anything > > in primitiveSignalAtMilliseconds that would behave any
differently on
> > a 64 bit versus 32 bit image or VM, but I do not know how to
test to
> > be sure. > > > > I bet that the following code from primitiveSignalAtMilliseconds
ends up
> wrapping when given > Delay primSignal: s atMilliseconds: Time
primMillisecondClock - 10.
> > deltaMsecs := msecs - (self ioMSecs bitAnd:
MillisecondClockMask).
> deltaMsecs < 0 ifTrue: > [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. > nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000). > > and I bet you'll find that the VM will wake up in about 6 days
and 5
hours > ;-) >
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
> I suppose we could fix this, but I'm *much* happier to simply
not use
> primitiveSignalAtMilliseconds and stay with the simpler and wrapping-immune > primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that
affects only
the 64-bit VM, and given that it expresses itself intermittently
and in
ways that affect only someone who is attempting to migrate their
existing V3
image to Spur, then I would say that it makes very good sense to take
the time
to fix it if we are able to do so. After all, there may be other
people
who will want to migrate V3 images to Spur, and there is no point in
making
the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I
am
asking for help here. Can someone with a working 64-bit build environment
please
check and see if what I said in an earlier email might make a
difference:
I see that ioMSecs() is declared as signed long (32 bits), but
it is
used in expression with a 64 bit usqInt. So maybe it needs a cast, or
maybe
the variables like msecs and deltaMsecs in
primitiveSignalAtMilliseconds
should be declared as 32 bit long and unsigned long to match the
actual
usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer,
thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) =
(ioUTCMicroseconds()) +
(deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into
a
positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which
is
HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) =
(ioUTCMicroseconds()) +
(deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then
(deltaMsecs *
- will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Nicolas
Anyway, the RollOver protection sounds strange: if clock was high,
and if
we don't apply the MillisecondClockMask to (clock+delay), then (clock+delay) won't overflow (thanks to largeIntegers).
Eventually,
for delay > 2^32-2^29 msecs, the primitive will fail thru (positive32BitValueOf). That's something like 43 days delay max, more than enough...
Either the code should rather be something like this with roll-over protection deltaMsecs = (msecs - ioMSecs()) & MillisecondClockMask; or we should leave deltaMsecs as is and reject negative delays
Also note this in 32 bits: Since *1000 consumes 10bits, then delays above 21 bits msecs will
overflow
and be signalled immediately. Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by
this
primitive is a bit less than 36 minutes (16r80000000/1000/1000/60). So maybe we should also reject large positive "unreasonable" delays,
that
isanthing above a few minutes and document it.
I think I understand what you are saying now. Let me see if I can
restate
the problem:
Before the change to microsecond clock, the interpreter VM did the rollover check only in #checkForInterrupts, and the arithmetic did not require a 64 bit microsecond range. Now we have rollover logic in #primitiveSignalAtMilliseconds, which leads to two issues, one of which masks the other on 32 bits.
Issue #1: If we pass to the primitive a millisecond count that is
slightly
in the past (possibly due to clock skew or other timing issues), then
the
primitive will test for deltaMsecs less than zero, and will incorrectly assume that a millisecond clock rollover has happened.
Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is declared as 32 bit sqInt, which results in overflow when converting to microseconds. For 64 bit sqInt, there is no overflow. This error masks the first problem of Issue #1, which makes the primiitive appear to work correctly on 32 bits, even though the clock rollover logic is wrong.
I do not think that we can simply fail the primitive, or set
deltaMSeconds
to zero, in the case of deltaMSeconds being less than zero. We still
need
to be able to schedule delays over a time period when the millisecond clock rolls over, so that needs to work.
We do need to fix the arithmetic overflow condition, which probably just requires declaring deltaMsecs as usqLong. This matches the declaration for nextWakeupUsecs and should prevent arithmetic overflow. But fixing this for issue #2 will unmask issue #1.
For issue #1, the solution probably requires a heuristic. If deltaMsecs is a big negative value, then the clock has rolled over. If it is a
small
negative value (maybe less than a second, just to pick a number), then it is probably clock skew and the MillisecondClockMask + 1 adjustment should not be applied.
Does this sound right?
I now have a working Cog/Spur build on my Ubuntu box, so I was able to try a few things. It turns out that the arithmetic overflow that Nicolas spotted seems to fully account for the problem that Juan originally reported on the 64 but Cuis image. The overflow happened here:
nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
Declaring deltaMsecs as a 64 bit unsigned to match the declaration of nextWakeupUsecs resolved the image lockup problem that Juan identified. I committed the fix in VMMaker.oscog-dtl.2147.
Dave
-- _,,,^..^,,,_ best, Eliot
On Fri, Mar 10, 2017 at 09:45:07AM +0100, Nicolas Cellier wrote:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF, but
on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is now using the microsecond clock in #handleTimerEvent. I do not see anything in primitiveSignalAtMilliseconds that would behave any differently on a 64 bit versus 32 bit image or VM, but I do not know how to test to be sure.
I bet that the following code from primitiveSignalAtMilliseconds ends up wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the primitive is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects only the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the time to fix it if we are able to do so. After all, there may be other people who will want to migrate V3 images to Spur, and there is no point in making the process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am asking for help here. Can someone with a working 64-bit build environment please check and see if what I said in an earlier email might make a difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is used in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds should be declared as 32 bit long and unsigned long to match the actual usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the deltaMsecs to zero. What do you think?
Hi Nicolas,
Thanks very much for looking into this and for the explanation. I am away right now so I cannot look at it now, but it does seem likely to be a sign extension issue at least in part.
If you have a chance to try recompiling with changes to the generated code, Juan's test condition is just this:
s := Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10. s wait.
This locks the image on 64-bits, and not on 32-bits.
Thanks a lot! Dave
2017-03-10 15:14 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Fri, Mar 10, 2017 at 09:45:07AM +0100, Nicolas Cellier wrote:
2017-03-10 3:47 GMT+01:00 David T. Lewis lewis@mail.msen.com:
On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
Hi David,
On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis lewis@mail.msen.com
wrote:
In the VM, the millisecond clock wraps within the 32 bit integer
range:
#define MillisecondClockMask 0x1FFFFFFF
In the Cuis image, Delay class>>handleTimerEvent does this:
nextTick := nextTick min: SmallInteger maxVal.
On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
but
on
a 32-bit image it is 16r3FFFFFFF.
Could that be it?
I don't really know how to test in Squeak. As you say, Squeak is
now
using the microsecond clock in #handleTimerEvent. I do not see
anything
in primitiveSignalAtMilliseconds that would behave any differently
on
a 64 bit versus 32 bit image or VM, but I do not know how to test
to
be sure.
I bet that the following code from primitiveSignalAtMilliseconds
ends up
wrapping when given Delay primSignal: s atMilliseconds: Time primMillisecondClock -
deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask). deltaMsecs < 0 ifTrue: [deltaMsecs := deltaMsecs + MillisecondClockMask + 1]. nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
and I bet you'll find that the VM will wake up in about 6 days and 5
hours
;-)
No. The failure is specific to the 64 bit VM. Source code for the
primitive
is the same in either case.
I suppose we could fix this, but I'm *much* happier to simply not use primitiveSignalAtMilliseconds and stay with the simpler and
wrapping-immune
primitiveSignalAtUTCMicroseconds
Fair enough, but given that there is a demonstrated bug that affects
only
the 64-bit VM, and given that it expresses itself intermittently and in ways that affect only someone who is attempting to migrate their existing V3 image to Spur, then I would say that it makes very good sense to take the
time
to fix it if we are able to do so. After all, there may be other
people who
will want to migrate V3 images to Spur, and there is no point in
making the
process needlessly difficult.
I do not have the solution, but maybe someone else can help. So I am
asking
for help here. Can someone with a working 64-bit build environment
please
check and see if what I said in an earlier email might make a
difference:
I see that ioMSecs() is declared as signed long (32 bits), but it is
used
in expression with a 64 bit usqInt. So maybe it needs a cast, or
maybe
the variables like msecs and deltaMsecs in
primitiveSignalAtMilliseconds
should be declared as 32 bit long and unsigned long to match the
actual
usage.
Thanks,
Dave
Hi David I just reviewed the generated code for primitiveSignalAtMilliseconds andhere is what i found:
The primitive will simply fail if fed with a negative integer, thanks to:
msecs = positive32BitValueOf(msecsObj);
msecs is allways positive, since declared as usqInt msecs;
So far, so good.
Then deltaMsecs = msecs - ((ioMSecs()) & MillisecondClockMask); if (deltaMsecs < 0) { deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; } GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000);
Due to the 29 bits Mask, the bitAnd: (&) operation will result into a positive int. 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
However, when we do the Roll-over thing: deltaMsecs = (deltaMsecs + MillisecondClockMask) + 1; Then we create a giant delay (the 6 days mentionned by Eliot) which is HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
The difference between 32 and 64 bits is in the next line: GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) + (deltaMsecs * 1000); Because sqInt deltaMsecs;
is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs * 1000) will do two different things:
- in 32 bits it will overflow leading to a negative delay and a
nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
- in 64 bits it won't overflow leading to he 6 days delay
Does that explain?
if (deltaMsecs < 0) I would either fail the primitive or set the
deltaMsecs
to zero. What do you think?
Hi Nicolas,
Thanks very much for looking into this and for the explanation. I am away right now so I cannot look at it now, but it does seem likely to be a sign extension issue at least in part.
If you have a chance to try recompiling with changes to the generated code, Juan's test condition is just this:
s := Semaphore new. Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10. s wait.
This locks the image on 64-bits, and not on 32-bits.
Thanks a lot! Dave
Hi Dave, No, it's not a sign extension problem.
The code is broken. The rollover protection is undue because we inject (clock+delay) and then subtract next_clock, In case of clock rollover, next_clock falls to zero and we will have a big delay, not a negative one. The rollover protection also transform a small negative delay into big delay.
Things seems to work in 32 bits just because we have an integer overflow when we convert msecs to usecs. big_delay_in_msec * 1000 => overflow => negative_delay_in_usecs
Hi Juan, Hi All,
On Mon, Mar 6, 2017 at 6:06 AM, Juan Vuletich JuanVuletich@zoho.com wrote:
I finally understood what was happening. The problem with is primitive 136:
primSignal: aSemaphore atMilliseconds: aSmallInteger "Signal the semaphore when the millisecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil. Essential. See Object documentation whatIsAPrimitive." <primitive: 136> ^self primitiveFailed
Yes, but primitive 136 *should not be used* :-). You should use primitive 242: primSignal: aSemaphore atUTCMicroseconds: anInteger "Signal the semaphore when the UTC microsecond clock reaches the value of the second argument. Fail if the first argument is neither a Semaphore nor nil, or if the second argument is not an integer. Essential. See Object documentation whatIsAPrimitive." <primitive: 242> ^self primitiveFailed
The use of the UTC microsecond count eliminates any and all wrapping issues for about 50,000 years [puhleaze Mr Pruitt can you try and do your bit to make it possible that we'll make it that far. hic. ed.]. The old primitive 136 primSignal:atMilliseconds: is subject to wrapping problems. I see that primSignal:atMilliseconds: is still used in Squeak 5.x but only to turn off signaling:
Delay class>>shutDown "Suspend the active delay, if any, before snapshotting. It will be reactived when the snapshot is resumed." "Details: This prevents a timer interrupt from waking up the active delay in the midst snapshoting, since the active delay will be restarted when resuming the snapshot and we don't want to process the delay twice."
AccessProtect wait. self primSignal: nil atMilliseconds: 0. self saveResumptionTimes. DelaySuspended := true.
Delay class>>stopTimerInterruptWatcher "Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten." "Delay startTimerInterruptWatcher" self primSignal: nil atMilliseconds: 0. TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].
I will change this now.
HTH
If the requested tick is already in the past, all the VMs except for Spur-64 will signal the semaphore anyway. Spur-64 won't. Taking care of this in the Cuis image was easy enough.
Squeak seems not to be affected because it uses a newer primitive, that takes the large integer microsecond ticker, that might have the older behavior (signaling anyway). I'm not really sure, and maybe someone could check the primitive behavior and the corresponding assumptions in Squeak.
Older Squeak images are not affected, as they are all 32-bit. So maybe there's no action required. In any case, making the behavior consistent across VMs and in both primitives would be nice.
Cheers,
On 3/3/2017 3:55 PM, Juan Vuletich wrote:
Hi Folks,
Today, I was able to get closer to the origin of the problem. The problem is when Morphic tries to do a Delay (inter-cycle-pause) of 1 or two milliseconds. (Usually they are quite larger, and the problem does not appear). With the 64-bit VM, sometimes very short delays hang the complete system. I don't know why, but this:
waitDelay _ Delay forMilliseconds: 50. [ true ] whileTrue: [ 1000 atRandom print. waitDelay setDelay: 1; wait ].
is enough to hang Cuis in a short time. When trying to reproduce the problem in Squeak I had mixed results. It doesn't seem to hang there, but if I let this run for a couple of minutes and then try to halt with alt-. , then the hang occurs, and Squeak becomes irresponsibe. As a workaround, in Cuis in the inter cycle pause, I'm avoiding very short delays.
I'm not sure if the bug is in the VM, in the image, or both, but it is still there, and it also affects Squeak. I haven't tried it, but Pharo might also have the problem (if not, it can show a possible solution).
BTW, Bert, I guess Squeak never does short delays in #interCyclePause: ...
Thanks,
On 2/6/2017 12:28 PM, Bert Freudenberg wrote:
Yes, I tried with Spur64. Have not seen it freeze yet on Mac in Squeak, but we possibly have not replicated the test well enough yet.
It *did* freeze in Cuis Spur64 on Mac too.
- Bert -
On Mon, Feb 6, 2017 at 2:08 PM, Juan Vuletich juanvuletich@zoho.com wrote:
Were you usig Spur64? On Debian it does not freeze with CogV3 or Spur32. Only with Spur64.
Thanks, Juan
Sent using Zoho Mail https://www.zoho.com/mail/
---- On Mon, 06 Feb 2017 09:19:48 -0300 * bert@freudenbergs.de bert@freudenbergs.de * wrote ----
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com wrote:
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping.
- Bert -
-- Juan Vuletichwww.cuis-smalltalk.orghttps://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
-- Juan Vuletichwww.cuis-smalltalk.orghttps://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev @JuanVuletich
On Mon, Feb 06, 2017 at 01:19:48PM +0100, Bert Freudenberg wrote:
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com wrote:
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping.
- Bert -
I am running it now on Ubuntu 64-bit Spur Squeak, with the same VM that I used with Cuis. I have been running a dozen Tokenish morphs for about an hour, so far no freeze. The morphs are bouncing much more slowly on Squeak than on Cuis, so I think that the test is loading Cuis much more heavily.
I cannot look further now, but I'll take another look in about 12 hours and see if I can speed up the Squeak animation and possibly freeze the image.
Dave
On Mon, Feb 6, 2017 at 3:11 PM, David T. Lewis lewis@mail.msen.com wrote:
On Mon, Feb 06, 2017 at 01:19:48PM +0100, Bert Freudenberg wrote:
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewis lewis@mail.msen.com
wrote:
It would be interesting to try reproducing this in a Squeak image, but
I
have not quite figured out how to make the Tokenish example work in
Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping.
- Bert -
I am running it now on Ubuntu 64-bit Spur Squeak, with the same VM that I used with Cuis. I have been running a dozen Tokenish morphs for about an hour, so far no freeze. The morphs are bouncing much more slowly on Squeak than on Cuis, so I think that the test is loading Cuis much more heavily.
I cannot look further now, but I'll take another look in about 12 hours and see if I can speed up the Squeak animation and possibly freeze the image.
With a single Tokenish morph and enabling the higherPerformance preference, the animation speed looks pretty much the same in Squeak. But with multiple ones animating, perf goes way down in Squeak :(
- Bert -
On 06/02/2017 11:11 a.m., David T. Lewis wrote:
On Mon, Feb 06, 2017 at 01:19:48PM +0100, Bert Freudenberg wrote:
On Mon, Feb 6, 2017 at 3:15 AM, David T. Lewislewis@mail.msen.com wrote:
It would be interesting to try reproducing this in a Squeak image, but I have not quite figured out how to make the Tokenish example work in Squeak.
See attachment. Did not freeze on Mac with 10 Tokenish jumping.
- Bert -
I am running it now on Ubuntu 64-bit Spur Squeak, with the same VM that I used with Cuis. I have been running a dozen Tokenish morphs for about an hour, so far no freeze. The morphs are bouncing much more slowly on Squeak than on Cuis, so I think that the test is loading Cuis much more heavily.
I cannot look further now, but I'll take another look in about 12 hours and see if I can speed up the Squeak animation and possibly freeze the image.
Dave
Thanks Dave. This is useful information.
Cheers,
At this point, I would simply install a REPL and would try to connect to the image after the morphic freeze to see what happened. I'm still not convinced about this being a VM bug. It might as well be an image side bug, which simply doesn't happen on windows due to VM differences.
Levente
On 02/05/2017 06:31 PM, Levente Uzonyi wrote:
At this point, I would simply install a REPL and would try to connect to the image after the morphic freeze to see what happened. I'm still not convinced about this being a VM bug. It might as well be an image side bug, which simply doesn't happen on windows due to VM differences.
In looking for REPL (it's in the Cuis base as Transcripter > readEvalPrint) I found a new way to get a hang-up. In a workspace do:
(Transcripter newInFrame: (0@0 extent: 100@200)) nextPutAll: 'Hello there'; endEntry; newLine; print: 355.0/113; endEntry; readEvalPrint.
...and the clock stops. 'top' reports 100 %CPU for squeak.
- Dan
On Mon, Feb 6, 2017 at 7:10 PM, Dan Norton dnorton@mindspring.com wrote:
In looking for REPL (it's in the Cuis base as Transcripter > readEvalPrint) I found a new way to get a hang-up. In a workspace do:
(Transcripter newInFrame: (0@0 extent: 100@200)) nextPutAll: 'Hello there'; endEntry; newLine; print: 355.0/113; endEntry; readEvalPrint.
...and the clock stops. 'top' reports 100 %CPU for squeak.
That is normal behavior - the REPL is not Morphic. Just type "quit" when you're done.
- Bert -
On 02/06/2017 01:16 PM, Bert Freudenberg wrote:
On Mon, Feb 6, 2017 at 7:10 PM, Dan Norton <dnorton@mindspring.com mailto:dnorton@mindspring.com> wrote:
In looking for REPL (it's in the Cuis base as Transcripter > readEvalPrint) I found a new way to get a hang-up. In a workspace do: (Transcripter newInFrame: (0@0 extent: 100@200)) nextPutAll: 'Hello there'; endEntry; newLine; print: 355.0/113; endEntry; readEvalPrint. ...and the clock stops. 'top' reports 100 %CPU for squeak.
That is normal behavior - the REPL is not Morphic. Just type "quit" when you're done.
OK, I'll try to apply it. Meanwhile the attached screen shot shows a hang-up which occurred as I tried to open a browser. The only things done at this point were to open a workspace, pin the Windows menu, load the Freeze package, and then try to open a browser.
- Dan
On Mon, Jan 30, 2017 at 8:58 PM, Dan Norton dnorton@mindspring.com wrote:
# strace -p 2475
produces lots of... nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) nanosleep({0, 0}, 0x7fff9b9b6060) = 0 select(4, [3], [], [3], {0, 0}) = 0 (Timeout) recvmsg(3, 0x7fff9b9b5df0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], [], [3], {0, 0}) = 0 (Timeout) ^Cnanosleep({0, 0}, Process 2475 detached <detached ...>
... before my packages are loaded as well as after the freeze. Identical AFAICT. alt-. is not processed after the freeze.
Normally I'd expect the VM to do a read() call after pressing a key. The select() call is to see if something is available, and if so, it should be received. Not sure why it's calling recvmsg() even though select() timed out ... Is this the normal behavior? I don't have Linux running to check.
When not frozen, pressing a key should result in recvmsg() retrieving that key event.
- Bert -
On 01/29/2017 06:02 AM, Levente Uzonyi wrote:
Hi Dan,
The expected output is in the file you attached. There seem to be no active processes to run, so only the idle process is being executed:
All Smalltalk process stacks (active first): Process 0x1c9a9c8 priority 10 0x7ffc012e81c0 M ProcessorScheduler class>idleProcess 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8200 I [] in ProcessorScheduler class>startUp 0x15d2a30: a(n) ProcessorScheduler class 0x7ffc012e8240 I [] in BlockClosure>newProcess 0x1d1aa40: a(n) BlockClosure
Most recent primitives relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: relinquishProcessorForMicroseconds: ...
Among the suspended processes is the UI process, which is waiting on a Delay. So, there may be issues on the image side with Delays.
Process 0x1c9a790 priority 40 0x7ffc012e60d8 M [] in Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6118 M BlockClosure>ifCurtailed: 0x11b1228: a(n) BlockClosure 0x7ffc012e6150 M Delay>wait 0x1c9ab48: a(n) Delay 0x7ffc012e6190 M WorldState>doOneCycleFor: 0x15cd6e0: a(n) WorldState 0x7ffc012e61c8 M PasteUpMorph>doOneCycle 0x165f890: a(n) PasteUpMorph 0x7ffc012e6200 M [] in ProjectX class>spawnNewMorphicProcessFor: 0x15d7900: a(n) ProjectX class 0x7ffc012e6240 I [] in BlockClosure>newProcess 0x1d1a940: a(n) BlockClosure
I don't see any sign of stack overflow you mentioned.
There are 3 sets of USR1 outputs. The overflows are at lines 600 and 622 in the second(?) output.
- Dan
vm-dev@lists.squeakfoundation.org