[Vm-dev] VM Maker: VMMaker.oscog-dtl.572.mcz

Mon Dec 30 18:29:20 UTC 2013

David T. Lewis uploaded a new version of VMMaker to project VM Maker:
http://source.squeak.org/VMMaker/VMMaker.oscog-dtl.572.mcz

==================== Summary ====================

Name: VMMaker.oscog-dtl.572
Author: dtl
Time: 30 December 2013, 1:22:13.502 pm
UUID: 5131e418-62ae-4a1c-b6a2-d6a4989574f1
Ancestors: VMMaker.oscog-tpr.571

Nicolas Cellier bitblt speedups
Reference Mantis issues 7802 and 7803

Mantis 7802: Fast-up BitBlt rgbAdd rule

Current implementation is naive:
- adds component by component after shifting and masking
- test each component for overflow
- recompose the result by shifting/bit-oring

It is possible to process the components all at once instead of looping and it is possible to avoid the test, making a branch free algorithm:

- in order to prevent overflowing over the next component, the highest bit of each component is cleared
- the last bit of each component is then obtained by overflow free operation (bitXor: instead of +)
- overflow condition is tested by some bitops, leaving a 1 on each highest which would overflow, 0 otherwise, which appropriately shifted and multiplied by component mask provide a mask to bitOr: for saturating without testing. 

Each component has nBits.
The componentMask is 1<<nBits-1.
The carryOverflowMask is a mask with high bit of each component set to 1.

For example with 4 components of 3 bits,
noting that 2r1001001001 * 2r111 -> 2r111111111111, we see that we can obtain such carryOverflowMask by a division 2r111111111111 / 2r111 -> 2r1001001001 and a shift << (nBits -1) -> 2r100100100100 

Mantis 7803: Fast-up BitBlt alpha blending rules

The alpha blending rules do apply on each ARGB component:
- shift and mask to isolate the component
- blend: ((1-alpha)*dst+(alpha*src))/255 (replace src by 16rFF for alpha channel)
- or (1-alpha)*dst>>8+src in case of pre-multiplied source
- saturate component at 16rFF in case of overflow for pre-multiplied source
  (this normally does not happen if source is correctly pre-multiplied)
- shift and bitOr: to recompose

Each component is on a byte.
Since multiplication spans over 2 bytes, it's possible to multiply 2 non adjacent components at once without overflowing on the other.
Thus, algorithm is applied on _R_B and A_G_>>8.

Division by 255 can be obtained by (x+1+(x>>8))>>8 for 0<=x<=16rFFFF.
This can easily be extended on two non adjacent bytes.

A reduction of about -25% of run time has been measured on an interpreter VM.

Note that very same kind of bit trick has been applied on accelerated ARM BitBlt.

See also VM-dev thread [Vm-dev] Little speed up of BitBlt alpha-blending
http://lists.squeakfoundation.org/pipermail/vm-dev/2013-December/014342.html

=============== Diff against VMMaker.oscog-tpr.571 ===============

Item was changed:
  ----- Method: BitBltSimulation>>alphaBlend:with: (in category 'combination rules') -----
  alphaBlend: sourceWord with: destinationWord
  	"Blend sourceWord with destinationWord, assuming both are 32-bit pixels.
  	The source is assumed to have 255*alpha in the high 8 bits of each pixel,
  	while the high 8 bits of the destinationWord will be ignored.
  	The blend produced is alpha*source + (1-alpha)*dest, with
  	the computation being performed independently on each color
  	component.  The high byte of the result will be 0."
+ 	| alpha unAlpha result blendRB blendAG |
- 	| alpha unAlpha colorMask result blend shift |
  	<inline: false>
+ 	<return: 'unsigned int'>
+ 	<var: #sourceWord type: 'unsigned int'>
+ 	<var: #destinationWord type: 'unsigned int'>
+ 	<var: #blendRB type: 'unsigned int'>
+ 	<var: #blendAG type: 'unsigned int'>
+ 	<var: #result type: 'unsigned int'>
+ 	<var: #alpha type: 'unsigned int'>
+ 	<var: #unAlpha type: 'unsigned int'>
  	alpha := sourceWord >> 24.  "High 8 bits of source pixel"
  	alpha = 0 ifTrue: [ ^ destinationWord ].
  	alpha = 255 ifTrue: [ ^ sourceWord ].
  	unAlpha := 255 - alpha.
- 	colorMask := 16rFF.
- 	result := 0.

+ 	blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
+ 				((destinationWord bitAnd: 16rFF00FF) * unAlpha)
+ 				+ 16rFF00FF.	"blend red and blue"
+ 
+ 	blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +
+ 				((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha)
+ 				+ 16rFF00FF.	"blend alpha and green"
+ 
+ 	blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF.	"divide by 255"
+ 	blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF.
+ 	result := blendRB bitOr: blendAG<<8.
- 	"red"
- 	shift := 0.
- 	blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
- 				((destinationWord>>shift bitAnd: colorMask) * unAlpha)
- 			 	+ 254 // 255 bitAnd: colorMask.
- 	result := result bitOr: blend << shift.
- 	"green"
- 	shift := 8.
- 	blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
- 				((destinationWord>>shift bitAnd: colorMask) * unAlpha)
- 			 	+ 254 // 255 bitAnd: colorMask.
- 	result := result bitOr: blend << shift.
- 	"blue"
- 	shift := 16.
- 	blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
- 				((destinationWord>>shift bitAnd: colorMask) * unAlpha)
- 			 	+ 254 // 255 bitAnd: colorMask.
- 	result := result bitOr: blend << shift.
- 	"alpha (pre-multiplied)"
- 	shift := 24.
- 	blend := (alpha * 255) +
- 				((destinationWord>>shift bitAnd: colorMask) * unAlpha)
- 			 	+ 254 // 255 bitAnd: colorMask.
- 	result := result bitOr: blend << shift.
  	^ result
  !

Item was changed:
  ----- Method: BitBltSimulation>>alphaBlendConst:with:paintMode: (in category 'combination rules') -----
  alphaBlendConst: sourceWord with: destinationWord paintMode: paintMode
  	"Blend sourceWord with destinationWord using a constant alpha.
  	Alpha is encoded as 0 meaning 0.0, and 255 meaning 1.0.
  	The blend produced is alpha*source + (1.0-alpha)*dest, with the
  	computation being performed independently on each color component.
  	This function could eventually blend into any depth destination,
  	using the same color averaging and mapping as warpBlt.
  	paintMode = true means do nothing if the source pixel value is zero."

  	"This first implementation works with dest depths of 16 and 32 bits only.
  	Normal color mapping will allow sources of lower depths in this case,
  	and results can be mapped directly by truncation, so no extra color maps are needed.
  	To allow storing into any depth will require subsequent addition of two other
  	colormaps, as is the case with WarpBlt."

+ 	| pixMask destShifted sourceShifted destPixVal rgbMask sourcePixVal unAlpha result pixBlend shift blend maskShifted bitsPerColor blendAG blendRB |
- 	| pixMask destShifted sourceShifted destPixVal rgbMask sourcePixVal unAlpha result pixBlend shift blend maskShifted bitsPerColor |
  	<inline: false>
+ 	<return: 'unsigned int'>
+ 	<var: #sourceWord type: 'unsigned int'>
+ 	<var: #destinationWord type: 'unsigned int'>
+ 	<var: #blendRB type: 'unsigned int'>
+ 	<var: #blendAG type: 'unsigned int'>
+ 	<var: #result type: 'unsigned int'>
+ 	<var: #sourceAlpha type: 'unsigned int'>
+ 	<var: #unAlpha type: 'unsigned int'>
  	destDepth < 16 ifTrue: [^ destinationWord "no-op"].
  	unAlpha := 255 - sourceAlpha.
- 	pixMask := maskTable at: destDepth.
- 	destDepth = 16 
- 		ifTrue: [bitsPerColor := 5]
- 		ifFalse:[bitsPerColor := 8].
- 	rgbMask := (1<<bitsPerColor) - 1.
- 	maskShifted := destMask.
- 	destShifted := destinationWord.
- 	sourceShifted := sourceWord.
  	result := destinationWord.
  	destPPW = 1 ifTrue:["32bpp blends include alpha"
  		paintMode & (sourceWord = 0)  "painting a transparent pixel" ifFalse:[
+ 
+ 				blendRB := ((sourceWord bitAnd: 16rFF00FF) * sourceAlpha) +
+ 						((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF.	"blendRB red and blue"
+ 
+ 				blendAG := ((sourceWord>> 8 bitAnd: 16rFF00FF) * sourceAlpha) +
+ 						((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF.	"blendRB alpha and green"
+ 
+ 				blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF.	"divide by 255"
+ 				blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF.
+ 				result := blendRB bitOr: blendAG<<8.
- 			result := 0.
- 			1 to: 4 do:[:i|
- 				shift := (i-1)*8.
- 				blend := (((sourceWord>>shift bitAnd: rgbMask) * sourceAlpha)
- 							+ ((destinationWord>>shift bitAnd: rgbMask) * unAlpha))
- 					 	+ 254 // 255 bitAnd: rgbMask.
- 				result := result bitOr: blend<<shift].
  		].
  	] ifFalse:[
+ 		pixMask := maskTable at: destDepth.
+ 		bitsPerColor := 5.
+ 		rgbMask := 16r1F.
+ 		maskShifted := destMask.
+ 		destShifted := destinationWord.
+ 		sourceShifted := sourceWord.
  		1 to: destPPW do:[:j |
  			sourcePixVal := sourceShifted bitAnd: pixMask.
  			((maskShifted bitAnd: pixMask) = 0  "no effect if outside of dest rectangle"
  				or: [paintMode & (sourcePixVal = 0)  "or painting a transparent pixel"])
  			ifFalse:
  				[destPixVal := destShifted bitAnd: pixMask.
  				pixBlend := 0.
  				1 to: 3 do:
  					[:i | shift := (i-1)*bitsPerColor.
  					blend := (((sourcePixVal>>shift bitAnd: rgbMask) * sourceAlpha)
  								+ ((destPixVal>>shift bitAnd: rgbMask) * unAlpha))
  						 	+ 254 // 255 bitAnd: rgbMask.
  					pixBlend := pixBlend bitOr: blend<<shift].
+ 				result := (result bitAnd: (pixMask << (j-1*16)) bitInvert32)
+ 								bitOr: pixBlend << (j-1*16)].
- 				destDepth = 16
- 					ifTrue: [result := (result bitAnd: (pixMask << (j-1*16)) bitInvert32)
- 										bitOr: pixBlend << (j-1*16)]
- 					ifFalse: [result := pixBlend]].
  			maskShifted := maskShifted >> destDepth.
  			sourceShifted := sourceShifted >> destDepth.
  			destShifted := destShifted >> destDepth].
  	].
  	^ result
  !

Item was changed:
  ----- Method: BitBltSimulation>>alphaBlendScaled:with: (in category 'combination rules') -----
  alphaBlendScaled: sourceWord with: destinationWord
  	"Blend sourceWord with destinationWord using the alpha value from sourceWord.
  	Alpha is encoded as 0 meaning 0.0, and 255 meaning 1.0.
  	In contrast to alphaBlend:with: the color produced is

  		srcColor + (1-srcAlpha) * dstColor

  	e.g., it is assumed that the source color is already scaled."
+ 	| unAlpha rb ag |
- 	| unAlpha dstMask srcMask b g r a |
  	<inline: false>	"Do NOT inline this into optimized loops"
+ 	<var: #sourceWord type: 'unsigned int'>
+ 	<var: #destinationWord type: 'unsigned int'>
+ 	<var: #rb type: 'unsigned int'>
+ 	<var: #ag type: 'unsigned int'>
+ 	<var: #unAlpha type: 'unsigned int'>
+ 	unAlpha := 255 - (sourceWord >> 24).  "High 8 bits of source pixel is source opacity (ARGB format)"
+ 	rb := ((destinationWord bitAnd: 16rFF00FF) * unAlpha >> 8 bitAnd: 16rFF00FF) + (sourceWord bitAnd: 16rFF00FF). "blend red and blue components"
+ 	ag := ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha >> 8 bitAnd: 16rFF00FF) + (sourceWord>>8 bitAnd: 16rFF00FF). "blend alpha and green components"
+ 	rb := (rb bitAnd: 16rFF00FF) bitOr: (rb bitAnd: 16r1000100) * 16rFF >> 8. "saturate red and blue components if there is a carry"
+ 	ag := (ag bitAnd: 16rFF00FF) << 8 bitOr: (ag bitAnd: 16r1000100) * 16rFF. "saturate alpha and green components if there is a carry"
+ 	^ag bitOr: rb "recompose"!
- 	unAlpha := 255 - (sourceWord >> 24).  "High 8 bits of source pixel"
- 	dstMask := destinationWord.
- 	srcMask := sourceWord.
- 	b := (dstMask bitAnd: 255) * unAlpha >> 8 + (srcMask bitAnd: 255).
- 	b > 255 ifTrue:[b := 255].
- 	dstMask := dstMask >> 8.
- 	srcMask := srcMask >> 8.
- 	g := (dstMask bitAnd: 255) * unAlpha >> 8 + (srcMask bitAnd: 255).
- 	g > 255 ifTrue:[g := 255].
- 	dstMask := dstMask >> 8.
- 	srcMask := srcMask >> 8.
- 	r := (dstMask bitAnd: 255) * unAlpha >> 8 + (srcMask bitAnd: 255).
- 	r > 255 ifTrue:[r := 255].
- 	dstMask := dstMask >> 8.
- 	srcMask := srcMask >> 8.
- 	a := (dstMask bitAnd: 255) * unAlpha >> 8 + (srcMask bitAnd: 255).
- 	a > 255 ifTrue:[a := 255].
- 	^(((((a << 8) + r) << 8) + g) << 8) + b!

Item was added:
+ ----- Method: BitBltSimulation>>partitionedAdd:to:nBits:componentMask:carryOverflowMask: (in category 'combination rules') -----
+ partitionedAdd: word1 to: word2 nBits: nBits componentMask: componentMask carryOverflowMask: carryOverflowMask
+ 	"Add word1 to word2 as nParts partitions of nBits each.
+ 	This is useful for packed pixels, or packed colors"
+ 	| carryOverflow sum w1 w2 |
+ 	"Use unsigned int everywhere because it has a well known arithmetic model without undefined behavior w.r.t. overflow and shifts"
+ 	 <var: #word1 type: 'unsigned int'>
+ 	<var: #word2 type: 'unsigned int'>
+ 	 <var: #w1 type: 'unsigned int'>
+ 	<var: #w2 type: 'unsigned int'>
+ 	<var: #componentMask type: 'unsigned int'>
+ 	<var: #carryOverflowMask type: 'unsigned int'>
+ 	<var: #carryOverflow type: 'unsigned int'>
+ 	<var: #sum type: 'unsigned int'>
+ 	w1 := word1 bitAnd: carryOverflowMask. "mask to remove high bit of each component"
+ 	w2 := word2 bitAnd: carryOverflowMask.
+ 	sum := (word1 bitXor: w1)+(word2 bitXor: w2). "sum without high bit to avoid overflowing over next component"
+ 	carryOverflow := (w1 bitAnd: w2) bitOr: ((w1 bitOr: w2) bitAnd: sum). "detect overflow condition for saturating"
+ 	^((sum bitXor: w1)bitXor:w2) "sum high bit without overflow"
+ 		bitOr: carryOverflow>>(nBits-1) * componentMask "saturate in case of overflow"!

Item was removed:
- ----- Method: BitBltSimulation>>partitionedAdd:to:nBits:nPartitions: (in category 'combination rules') -----
- partitionedAdd: word1 to: word2 nBits: nBits nPartitions: nParts
- 	"Add word1 to word2 as nParts partitions of nBits each.
- 	This is useful for packed pixels, or packed colors"
- 	| mask sum result maskedWord1 |
- 	"In C, most arithmetic operations answer the same bit pattern regardless of the operands being signed or unsigned ints
- 	(this is due to the way 2's complement numbers work). However, comparisions might fail. Add the proper declaration of
- 	words as unsigned int in those cases where comparisions are done (jmv)"
- 	<var: #word1 type: 'unsigned int'>
- 	<var: #word2 type: 'unsigned int'>
- 	<var: #mask type: 'unsigned int'>
- 	<var: #sum type: 'unsigned int'>
- 	<var: #result type: 'unsigned int'>
- 	<var: #maskedWord1 type: 'unsigned int'>
- 	mask := maskTable at: nBits.  "partition mask starts at the right"
- 	result := 0.
- 	1 to: nParts do:
- 		[:i |
- 		maskedWord1 := word1 bitAnd: mask.
- 		sum := maskedWord1 + (word2 bitAnd: mask).
- 		(sum <= mask "result must not carry out of partition"
- 				and: [ sum >= maskedWord1 ])	"This is needed because in C, integer arithmetic overflows silently!! (jmv)"
- 			ifTrue: [result := result bitOr: sum]
- 			ifFalse: [result := result bitOr: mask].
- 		mask := mask << nBits  "slide left to next partition"].
- 	^ result
- !

Item was changed:
  ----- Method: BitBltSimulation>>rgbAdd:with: (in category 'combination rules') -----
  rgbAdd: sourceWord with: destinationWord
  	<inline: false>
+ 	<var: #sourceWord type: 'unsigned int'>
+ 	<var: #destinationWord type: 'unsigned int'>
+ 	<var: #carryOverflowMask type: 'unsigned int'>
+ 	<var: #componentMask type: 'unsigned int'>
+ 	| componentMask carryOverflowMask |
  	destDepth < 16 ifTrue:
  		["Add each pixel separately"
+ 		componentMask := 1<<destDepth-1.
+ 		carryOverflowMask := 16rFFFFFFFF//componentMask<<(destDepth-1).
  		^ self partitionedAdd: sourceWord to: destinationWord
+ 						nBits: destDepth componentMask: componentMask carryOverflowMask: carryOverflowMask].
- 						nBits: destDepth nPartitions: destPPW].
  	destDepth = 16 ifTrue:
  		["Add RGB components of each pixel separately"
+ 		componentMask := 16r1F.
+ 		carryOverflowMask := 16r42104210.
+ 		^ (self partitionedAdd: (sourceWord bitAnd: 16r7FFF7FFF) to: (destinationWord bitAnd: 16r7FFF7FFF) "make sure that the unused bit is at 0"
+ 						nBits: 5 componentMask: componentMask carryOverflowMask: carryOverflowMask)]
- 		^ (self partitionedAdd: sourceWord to: destinationWord
- 						nBits: 5 nPartitions: 3)
- 		+ ((self partitionedAdd: sourceWord>>16 to: destinationWord>>16
- 						nBits: 5 nPartitions: 3) << 16)]
  	ifFalse:
  		["Add RGBA components of the pixel separately"
+ 		componentMask := 16rFF.
+ 		carryOverflowMask := 16r80808080.
  		^ self partitionedAdd: sourceWord to: destinationWord
+ 						nBits: 8 componentMask: componentMask carryOverflowMask: carryOverflowMask]!
- 						nBits: 8 nPartitions: 4]!