[Vm-dev] We need help from VM experts. Re: Freeze after Morph& nbsp; & nbsp; & nbsp; & nbsp; Activity

Eliot Miranda eliot.miranda at gmail.com
Mon Mar 13 01:47:55 UTC 2017


Thanks David!

On Sun, Mar 12, 2017 at 3:52 PM, David T. Lewis <lewis at mail.msen.com> wrote:

>
> On Fri, Mar 10, 2017 at 11:49:43PM -0500, David T. Lewis wrote:
> >
> > On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
> > >
> > > 2017-03-10 9:45 GMT+01:00 Nicolas Cellier <
> > > nicolas.cellier.aka.nice at gmail.com>:
> > >
> > > >
> > > >
> > > > 2017-03-10 3:47 GMT+01:00 David T. Lewis <lewis at mail.msen.com>:
> > > >
> > > >>
> > > >> On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
> > > >> >
> > > >> > Hi David,
> > > >> >
> > > >> > On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis <
> lewis at mail.msen.com>
> > > >> wrote:
> > > >> >
> > > >> > >
> > > >> > > In the VM, the millisecond clock wraps within the 32 bit integer
> > > >> range:
> > > >> > >
> > > >> > >   #define MillisecondClockMask 0x1FFFFFFF
> > > >> > >
> > > >> > > In the Cuis image, Delay class>>handleTimerEvent does this:
> > > >> > >
> > > >> > >   nextTick := nextTick min: SmallInteger maxVal.
> > > >> > >
> > > >> > > On a 64-bit Spur image, SmallInteger maxVal is
> 16rFFFFFFFFFFFFFFF,
> > > >> but on
> > > >> > > a 32-bit image it is 16r3FFFFFFF.
> > > >> > >
> > > >> > > Could that be it?
> > > >> > >
> > > >> > > I don't really know how to test in Squeak. As you say, Squeak
> is now
> > > >> > > using the microsecond clock in #handleTimerEvent. I do not see
> > > >> anything
> > > >> > > in primitiveSignalAtMilliseconds that would behave any
> differently on
> > > >> > > a 64 bit versus 32 bit image or VM, but I do not know how to
> test to
> > > >> > > be sure.
> > > >> > >
> > > >> >
> > > >> > I bet that the following code from primitiveSignalAtMilliseconds
> ends up
> > > >> > wrapping when given
> > > >> >     Delay primSignal: s atMilliseconds: Time primMillisecondClock
> - 10.
> > > >> >
> > > >> > deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask).
> > > >> > deltaMsecs < 0 ifTrue:
> > > >> > [deltaMsecs := deltaMsecs + MillisecondClockMask + 1].
> > > >> > nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
> > > >> >
> > > >> > and I bet you'll find that the VM will wake up in about 6 days
> and 5
> > > >> hours
> > > >> > ;-)
> > > >> >
> > > >>
> > > >> No. The failure is specific to the 64 bit VM. Source code for the
> > > >> primitive
> > > >> is the same in either case.
> > > >>
> > > >>
> > > > >
> > > >> > I suppose we could fix this, but I'm *much* happier to simply not
> use
> > > >> > primitiveSignalAtMilliseconds and stay with the simpler and
> > > >> wrapping-immune
> > > >> > primitiveSignalAtUTCMicroseconds
> > > >>
> > > >> Fair enough, but given that there is a demonstrated bug that
> affects only
> > > >> the 64-bit VM, and given that it expresses itself intermittently
> and in
> > > >> ways
> > > >> that affect only someone who is attempting to migrate their
> existing V3
> > > >> image
> > > >> to Spur, then I would say that it makes very good sense to take the
> time
> > > >> to fix it if we are able to do so. After all, there may be other
> people
> > > >> who
> > > >> will want to migrate V3 images to Spur, and there is no point in
> making
> > > >> the
> > > >> process needlessly difficult.
> > > >>
> > > >> I do not have the solution, but maybe someone else can help. So I am
> > > >> asking
> > > >> for help here. Can someone with a working 64-bit build environment
> please
> > > >> check and see if what I said in an earlier email might make a
> difference:
> > > >>
> > > >>   I see that ioMSecs() is declared as signed long (32 bits), but it
> is
> > > >> used
> > > >>   in expression with a 64 bit usqInt. So maybe it needs a cast, or
> maybe
> > > >>   the variables like msecs and deltaMsecs in
> primitiveSignalAtMilliseconds
> > > >>   should be declared as 32 bit long and unsigned long to match the
> actual
> > > >> usage.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Dave
> > > >>
> > > >>
> > > >
> > > > Hi David
> > > > I just reviewed the generated code for primitiveSignalAtMilliseconds
> > > > andhere is what i found:
> > > >
> > > > The primitive will simply fail if fed with a negative integer,
> thanks to:
> > > >
> > > >         msecs = positive32BitValueOf(msecsObj);
> > > >
> > > > msecs is allways positive, since declared as
> > > >     usqInt msecs;
> > > >
> > > > So far, so good.
> > > >
> > > > Then
> > > >                         deltaMsecs = msecs - ((ioMSecs()) &
> > > > MillisecondClockMask);
> > > >                         if (deltaMsecs < 0) {
> > > >                                 deltaMsecs = (deltaMsecs +
> > > > MillisecondClockMask) + 1;
> > > >                         }
> > > >                         GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
> +
> > > > (deltaMsecs * 1000);
> > > >
> > > > Due to the 29 bits Mask, the bitAnd: (&) operation will result into a
> > > > positive int.
> > > > 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
> > > >
> > > > However, when we do the Roll-over thing:
> > > >                                 deltaMsecs = (deltaMsecs +
> > > > MillisecondClockMask) + 1;
> > > > Then we create a giant delay (the 6 days mentionned by Eliot) which
> is
> > > > HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
> > > >
> > > > The difference between 32 and 64 bits is in the next line:
> > > >                         GIV(nextWakeupUsecs) = (ioUTCMicroseconds())
> +
> > > > (deltaMsecs * 1000);
> > > > Because
> > > >     sqInt deltaMsecs;
> > > >
> > > > is 32 bits for a 32bits spur, 64bits for 64bits spur, then
> (deltaMsecs *
> > > > 1000) will do two different things:
> > > > - in 32 bits it will overflow leading to a negative delay and a
> > > > nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
> > > > - in 64 bits it won't overflow leading to he 6 days delay
> > > >
> > > > Does that explain?
> > > >
> > > > if (deltaMsecs < 0) I would either fail the primitive or set the
> > > > deltaMsecs to zero.
> > > > What do you think?
> > > >
> > > >  Nicolas
> > > >
> > > >
> > >
> > > Anyway, the RollOver protection sounds strange: if clock was high, and
> if
> > > we don't apply the MillisecondClockMask to (clock+delay),
> > > then (clock+delay) won't overflow (thanks to largeIntegers).
> Eventually,
> > > for delay > 2^32-2^29 msecs, the primitive will fail thru
> > > (positive32BitValueOf).
> > > That's something like 43 days delay max, more than enough...
> > >
> > > Either the code should rather be something like this with roll-over
> > > protection
> > >                         deltaMsecs = (msecs - ioMSecs()) &
> > > MillisecondClockMask;
> > > or we should leave deltaMsecs as is and reject negative delays
> > >
> > > Also note this in 32 bits:
> > > Since *1000 consumes 10bits, then delays above 21 bits msecs will
> overflow
> > > and be signalled immediately.
> > > Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by
> this
> > > primitive is a bit less than 36 minutes (16r80000000/1000/1000/60).
> > > So maybe we should also reject large positive "unreasonable" delays,
> that
> > > isanthing above a few minutes and document it.
> >
> > I think I understand what you are saying now. Let me see if I can restate
> > the problem:
> >
> > Before the change to microsecond clock, the interpreter VM did the
> > rollover check only in #checkForInterrupts, and the arithmetic did not
> > require a 64 bit microsecond range. Now we have rollover logic in
> > #primitiveSignalAtMilliseconds, which leads to two issues, one of which
> > masks the other on 32 bits.
> >
> > Issue #1: If we pass to the primitive a millisecond count that is
> slightly
> > in the past (possibly due to clock skew or other timing issues), then the
> > primitive will test for deltaMsecs less than zero, and will incorrectly
> > assume that a millisecond clock rollover has happened.
> >
> > Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is
> > declared as 32 bit sqInt, which results in overflow when converting to
> > microseconds. For 64 bit sqInt, there is no overflow. This error masks
> > the first problem of Issue #1, which makes the primiitive appear to work
> > correctly on 32 bits, even though the clock rollover logic is wrong.
> >
> > I do not think that we can simply fail the primitive, or set
> deltaMSeconds
> > to zero, in the case of deltaMSeconds being less than zero. We still need
> > to be able to schedule delays over a time period when the millisecond
> > clock rolls over, so that needs to work.
> >
> > We do need to fix the arithmetic overflow condition, which probably just
> > requires declaring deltaMsecs as usqLong. This matches the declaration
> > for nextWakeupUsecs and should prevent arithmetic overflow. But fixing
> > this for issue #2 will unmask issue #1.
> >
> > For issue #1, the solution probably requires a heuristic. If deltaMsecs
> > is a big negative value, then the clock has rolled over. If it is a small
> > negative value (maybe less than a second, just to pick a number), then
> > it is probably clock skew and the MillisecondClockMask + 1 adjustment
> > should not be applied.
> >
> > Does this sound right?
> >
>
> I now have a working Cog/Spur build on my Ubuntu box, so I was able to
> try a few things. It turns out that the arithmetic overflow that Nicolas
> spotted seems to fully account for the problem that Juan originally
> reported on the 64 but Cuis image. The overflow happened here:
>
>   nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
>
> Declaring deltaMsecs as a 64 bit unsigned to match the declaration of
> nextWakeupUsecs resolved the image lockup problem that Juan identified.
> I committed the fix in VMMaker.oscog-dtl.2147.
>
> Dave
>
>


-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170312/c09969ac/attachment-0001.html>


More information about the Vm-dev mailing list