[Vm-dev] VM Maker: VMMaker.oscog-nice.2572.mcz

commits at source.squeak.org commits at source.squeak.org
Fri Oct 25 22:12:24 UTC 2019


Nicolas Cellier uploaded a new version of VMMaker to project VM Maker:
http://source.squeak.org/VMMaker/VMMaker.oscog-nice.2572.mcz

==================== Summary ====================

Name: VMMaker.oscog-nice.2572
Author: nice
Time: 26 October 2019, 12:11:24.062926 am
UUID: 3489e554-cf3d-44fb-bfac-638c04ecdcf2
Ancestors: VMMaker.oscog-eem.2571

Fix the problem with small delays in old images, which occasionally break OpenSmalltalk CI as reported in issue 436

https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/436

The problem has been observed on a macos32x86/squeak.cog.v3 VM + Squeak-4.6 image still using millisecond clock to handle delays (and thus #primitiveSignalAtMilliseconds).
It happens with small delays that may be seen as slightly negative inside the primitive, and erroneously confused with a clock roll-over.

This should not happen in newer image that use microsecond clock, though I didn't perform any related analysis.
The problem can easily be triggered by this snippet (save you work first!):

	s := Semaphore new.
	Delay primSignal: s atMilliseconds: Time primMillisecondClock - 2.
	s wait.

Now, the detailed explanations:

---------------------------------------------------

At image side, the absolute time msecs at which to wake up is usually formed by adding:

	msecs := absolute_millisecond_clock1 + relative_delay_duration.

In #primitiveSignalAtMilliseconds, some time has elapsed since, and we subtract millisecond clock to retrieve the remaining duration to wait:

	deltaMsecs := msecs - absolute_millisecond_clock2.

If ever the millisecond clock roll-over in between the two calls clock1 and clock2,
then we will have clock1 close to clock2 + MillisecondClockMask, and:

	deltaMsecs := relative_delay_duration - already_elapsed_time + MillisecondClockMask

that is a huge delay, 1<<29 milliseconds, about 6 days!

A negative deltaMsecs means an already expired delay, it cannot happen because of rollover
(unless the VM hangs more than 3 days, case which we do not have to handle: the delay is then expired because we cannot handle such huge delays - see below).

Technically, the usage of #positive32BitValueOf: would allow passing msecs upTo: 1<<32-1, that is a huge delay (at least 1<<32 - (1<<29), about 48 days).

But how to distinguish a huge relative_delay_duration from a rollover?
We can't really. We might check if clock2 is small, and declare it's a rollover if under a threshold, else a big delay.
But there will always be edge cases when such oracle would give the wrong answer.
We could try and pass 3 arguments instead of 2, semaphore, clock1 , relativeDelay.
But are we going to need it? Not really.
With a clock that roll-over like this, the only sane thing is to cap the delay at midway:

	limit := MillisecondClockMask >> 1.

Then reduce a positive deltaMsecs with centered modulo:

	deltaMsecs > limit ifTrue: [deltaMsecs := deltaMsecs - MillisecondClockMask].
	
Only a strictly positive deltaMsecs need then to be honored as the next wake up time (nextWakeupUsecs).

---------------------------------------------------

But wait, this is only one face of the problem!

Due to current implementation of ioRelinquishProcessorForMicroseconds(),
the microsecond increment (deltaMsecs * 1000) must fit in a long.
If ever it overflows, and wrap over negated values, then bad consequences will happen via aioSleepForUsecs() and aioPoll().
This is including unresponsive runaway images spitting out errno 22; select: invalid argument in the console, as observed in bug report.
On 32bits unix VM, that further limit the maximum delay to about 2147 seconds!

IMO, it's urgent to protect ioRelinquishProcessorForMicroseconds() from such overflow,
But that will deserve another commit (it's platforms source code, not Smalltalk code).

=============== Diff against VMMaker.oscog-eem.2571 ===============

Item was changed:
  ----- Method: StackInterpreterPrimitives>>primitiveSignalAtMilliseconds (in category 'system control primitives') -----
  primitiveSignalAtMilliseconds
  	"Cause the time semaphore, if one has been registered, to be
  	 signalled when the microsecond clock is greater than or equal to
  	 the given tick value. A tick value of zero turns off timer interrupts."
+ 	| msecsObj msecs deltaMsecs sema limit |
- 	| msecsObj msecs deltaMsecs sema |
  	<var: #msecs type: #usqInt>
  	<var: #deltaMsecs type: #sqLong>
+ 	<var: #limit type: #sqLong>
  	msecsObj := self stackTop.
  	sema := self stackValue: 1.
  	msecs := self positive32BitValueOf: msecsObj.
  	
  	self successful ifTrue:
  		[(objectMemory isSemaphoreOop: sema) ifTrue:
  			[objectMemory splObj: TheTimerSemaphore put: sema.
  			 deltaMsecs := msecs - (self ioMSecs bitAnd: MillisecondClockMask).
+ 			 limit := MillisecondClockMask >> 1.
+ 			 "Handle a roll-over that could happen in between image invocation of ioMSecs and this invocation.
+ 			  This will limit the maximum relative duration to MillisecondClockMask/2, about 3 days currently.
+ 			  Every delay longer than that limit may lead to undefined behavior (shorten delay, or no delay at all).
+ 			  The maximum delay might be further limited by platform dependent nextWakeupUsecs handling."
+ 			 deltaMsecs > limit ifTrue: [deltaMsecs := deltaMsecs - MillisecondClockMask].
+ 			 deltaMsecs > 0
+ 				ifTrue:
+ 					[nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000)].
- 			 deltaMsecs < 0 ifTrue:
- 				[deltaMsecs := deltaMsecs + MillisecondClockMask + 1].
- 			 nextWakeupUsecs := self ioUTCMicroseconds + (deltaMsecs * 1000).
  			 ^self pop: 2].
  		 sema = objectMemory nilObject ifTrue:
  			[objectMemory
  				storePointer: TheTimerSemaphore
  				ofObject: objectMemory specialObjectsOop
  				withValue: objectMemory nilObject.
  			 nextWakeupUsecs := 0.
  			 ^self pop: 2]].
  	self primitiveFailFor: PrimErrBadArgument!



More information about the Vm-dev mailing list