[Vm-dev] We need help from VM experts. Re: Freeze after Morph& nbsp; & nbsp; & nbsp; & nbsp; Activity

David T. Lewis lewis at mail.msen.com
Sat Mar 11 04:49:43 UTC 2017


On Fri, Mar 10, 2017 at 10:19:42AM +0100, Nicolas Cellier wrote:
>  
> 2017-03-10 9:45 GMT+01:00 Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com>:
> 
> >
> >
> > 2017-03-10 3:47 GMT+01:00 David T. Lewis <lewis at mail.msen.com>:
> >
> >>
> >> On Thu, Mar 09, 2017 at 12:41:44PM -0800, Eliot Miranda wrote:
> >> >
> >> > Hi David,
> >> >
> >> > On Mon, Mar 6, 2017 at 5:50 PM, David T. Lewis <lewis at mail.msen.com>
> >> wrote:
> >> >
> >> > >
> >> > > In the VM, the millisecond clock wraps within the 32 bit integer
> >> range:
> >> > >
> >> > >   #define MillisecondClockMask 0x1FFFFFFF
> >> > >
> >> > > In the Cuis image, Delay class>>handleTimerEvent does this:
> >> > >
> >> > >   nextTick := nextTick min: SmallInteger maxVal.
> >> > >
> >> > > On a 64-bit Spur image, SmallInteger maxVal is 16rFFFFFFFFFFFFFFF,
> >> but on
> >> > > a 32-bit image it is 16r3FFFFFFF.
> >> > >
> >> > > Could that be it?
> >> > >
> >> > > I don't really know how to test in Squeak. As you say, Squeak is now
> >> > > using the microsecond clock in #handleTimerEvent. I do not see
> >> anything
> >> > > in primitiveSignalAtMilliseconds that would behave any differently on
> >> > > a 64 bit versus 32 bit image or VM, but I do not know how to test to
> >> > > be sure.
> >> > >
> >> >
> >> > I bet that the following code from primitiveSignalAtMilliseconds ends up
> >> > wrapping when given
> >> >     Delay primSignal: s atMilliseconds: Time primMillisecondClock - 10.
> >> >
> >> > deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask).
> >> > deltaMsecs < 0 ifTrue:
> >> > [deltaMsecs := deltaMsecs + MillisecondClockMask + 1].
> >> > nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
> >> >
> >> > and I bet you'll find that the VM will wake up in about 6 days and 5
> >> hours
> >> > ;-)
> >> >
> >>
> >> No. The failure is specific to the 64 bit VM. Source code for the
> >> primitive
> >> is the same in either case.
> >>
> >>
> > >
> >> > I suppose we could fix this, but I'm *much* happier to simply not use
> >> > primitiveSignalAtMilliseconds and stay with the simpler and
> >> wrapping-immune
> >> > primitiveSignalAtUTCMicroseconds
> >>
> >> Fair enough, but given that there is a demonstrated bug that affects only
> >> the 64-bit VM, and given that it expresses itself intermittently and in
> >> ways
> >> that affect only someone who is attempting to migrate their existing V3
> >> image
> >> to Spur, then I would say that it makes very good sense to take the time
> >> to fix it if we are able to do so. After all, there may be other people
> >> who
> >> will want to migrate V3 images to Spur, and there is no point in making
> >> the
> >> process needlessly difficult.
> >>
> >> I do not have the solution, but maybe someone else can help. So I am
> >> asking
> >> for help here. Can someone with a working 64-bit build environment please
> >> check and see if what I said in an earlier email might make a difference:
> >>
> >>   I see that ioMSecs() is declared as signed long (32 bits), but it is
> >> used
> >>   in expression with a 64 bit usqInt. So maybe it needs a cast, or maybe
> >>   the variables like msecs and deltaMsecs in primitiveSignalAtMilliseconds
> >>   should be declared as 32 bit long and unsigned long to match the actual
> >> usage.
> >>
> >> Thanks,
> >>
> >> Dave
> >>
> >>
> >
> > Hi David
> > I just reviewed the generated code for primitiveSignalAtMilliseconds
> > andhere is what i found:
> >
> > The primitive will simply fail if fed with a negative integer, thanks to:
> >
> >         msecs = positive32BitValueOf(msecsObj);
> >
> > msecs is allways positive, since declared as
> >     usqInt msecs;
> >
> > So far, so good.
> >
> > Then
> >                         deltaMsecs = msecs - ((ioMSecs()) &
> > MillisecondClockMask);
> >                         if (deltaMsecs < 0) {
> >                                 deltaMsecs = (deltaMsecs +
> > MillisecondClockMask) + 1;
> >                         }
> >                         GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) +
> > (deltaMsecs * 1000);
> >
> > Due to the 29 bits Mask, the bitAnd: (&) operation will result into a
> > positive int.
> > 0 <= ((ioMSecs()) & MillisecondClockMask) <= 0x1FFFFFFF
> >
> > However, when we do the Roll-over thing:
> >                                 deltaMsecs = (deltaMsecs +
> > MillisecondClockMask) + 1;
> > Then we create a giant delay (the 6 days mentionned by Eliot) which is
> > HIGHLY QUESTIONNABLE whatever 32 or 64 bits!
> >
> > The difference between 32 and 64 bits is in the next line:
> >                         GIV(nextWakeupUsecs) = (ioUTCMicroseconds()) +
> > (deltaMsecs * 1000);
> > Because
> >     sqInt deltaMsecs;
> >
> > is 32 bits for a 32bits spur, 64bits for 64bits spur, then (deltaMsecs *
> > 1000) will do two different things:
> > - in 32 bits it will overflow leading to a negative delay and a
> > nextWakeupUsecs in the past (but that's invoking Undefined Behavior)
> > - in 64 bits it won't overflow leading to he 6 days delay
> >
> > Does that explain?
> >
> > if (deltaMsecs < 0) I would either fail the primitive or set the
> > deltaMsecs to zero.
> > What do you think?
> >
> >  Nicolas
> >
> >
> 
> Anyway, the RollOver protection sounds strange: if clock was high, and if
> we don't apply the MillisecondClockMask to (clock+delay),
> then (clock+delay) won't overflow (thanks to largeIntegers). Eventually,
> for delay > 2^32-2^29 msecs, the primitive will fail thru
> (positive32BitValueOf).
> That's something like 43 days delay max, more than enough...
> 
> Either the code should rather be something like this with roll-over
> protection
>                         deltaMsecs = (msecs - ioMSecs()) &
> MillisecondClockMask;
> or we should leave deltaMsecs as is and reject negative delays
> 
> Also note this in 32 bits:
> Since *1000 consumes 10bits, then delays above 21 bits msecs will overflow
> and be signalled immediately.
> Tha's about 2e6 msecs, 2e3 secs. Thus the maximum delay supported by this
> primitive is a bit less than 36 minutes (16r80000000/1000/1000/60).
> So maybe we should also reject large positive "unreasonable" delays, that
> isanthing above a few minutes and document it.

I think I understand what you are saying now. Let me see if I can restate
the problem:

Before the change to microsecond clock, the interpreter VM did the
rollover check only in #checkForInterrupts, and the arithmetic did not
require a 64 bit microsecond range. Now we have rollover logic in
#primitiveSignalAtMilliseconds, which leads to two issues, one of which
masks the other on 32 bits.

Issue #1: If we pass to the primitive a millisecond count that is slightly
in the past (possibly due to clock skew or other timing issues), then the
primitive will test for deltaMsecs less than zero, and will incorrectly
assume that a millisecond clock rollover has happened.

Issue #2: When adjusting for clock rollover on 32 bits, deltaMsecs is
declared as 32 bit sqInt, which results in overflow when converting to
microseconds. For 64 bit sqInt, there is no overflow. This error masks
the first problem of Issue #1, which makes the primiitive appear to work
correctly on 32 bits, even though the clock rollover logic is wrong. 

I do not think that we can simply fail the primitive, or set deltaMSeconds
to zero, in the case of deltaMSeconds being less than zero. We still need
to be able to schedule delays over a time period when the millisecond
clock rolls over, so that needs to work.

We do need to fix the arithmetic overflow condition, which probably just
requires declaring deltaMsecs as usqLong. This matches the declaration
for nextWakeupUsecs and should prevent arithmetic overflow. But fixing
this for issue #2 will unmask issue #1.

For issue #1, the solution probably requires a heuristic. If deltaMsecs
is a big negative value, then the clock has rolled over. If it is a small
negative value (maybe less than a second, just to pick a number), then
it is probably clock skew and the MillisecondClockMask + 1 adjustment
should not be applied.

Does this sound right?

Dave



More information about the Vm-dev mailing list