In depth 1, the resulting bits will always be 0.

It's not a big problem because rgbMul is just a bitAnd operation at this depth.

So a quick workaround would be to detect the case in BitBltSimulation

destDepth = 1 ifTrue: [^self bitAnd: sourceWord with: destinationWord].

That would also accelerate the Bit BLock Transfer operation, so it's a good hack.

But there is more. What we want is multiply ratios in interval [0,1].

dstRatio * srcRatio

Our implementation is scaled ratio (scaled by 1 << nBits - 1):

src := (srcRatio * scale) rounded.
dst := (dstRatio * scale) rounded.

So what we want is:

((dst/scale) * (src/scale) * scale) rounded

that is:

(dst*src / (1<<nBits-1)) rounded

Unfortunately, that's the other grief with the current implementation used for rounding:

(dst+1)*(src+1) - 1 >> nBits

It only equals correctly rounded operation for depths 2 and 4.

For rounding we might use:

(((dst/scale) * (src/scale) + 0.5) * scale) truncated.

that is expressed with truncated division:

dst*src + (scale+1//2) // scale

So here is a nicer formulation for doing the job at any depth (including 5bits rgb channels for 16 bits depth) with correctly rounded division:

aux := src * dst + (1 << (nBits - 1)). "add mid-scale for rounding"
result := aux << (nBits - 1) + aux << (nBits -1). "divide by scale"

This is because instead of dividing by scale, we can multiply by shifted inverse (sort of double precision), then shift right.

(2 to: 32) allSatisfy: [:nBits | (1 << (nBits * 2) / (1 << nBits - 1)) rounded = (1 << nBits + 1)].

Multiplying by this inverse is easy and cheap:

x * (1 << nBits + 1) = (x << nBits + x).

And then applying the right shift >> (2 * nBits) is equivalent to:

x >> nBits + x >> nBits.

We must first add 0.5 (scaled), that is src * dst + (1 << (nBits -1)) - our formulation of aux, and we're done.

We verify:

	{
	(0 to: 1<<20-1) allSatisfy: [:i | (1<<9+i)>>10+ (1<<9+i)>>10 = (i/1023) rounded].
	(0 to: 1<<18-1) allSatisfy: [:i | (1<<8+i)>>9+ (1<<8+i)>>9 = (i/511) rounded].
	(0 to: 1<<16-1) allSatisfy: [:i | (1<<7+i)>>8+ (1<<7+i)>>8 = (i/255) rounded].
	(0 to: 1<<14-1) allSatisfy: [:i | (1<<6+i)>>7+ (1<<6+i)>>7 = (i/127) rounded].
	(0 to: 1<<12-1) allSatisfy: [:i | (1<<5+i)>>6+ (1<<5+i)>>6 = (i/63) rounded].
	(0 to: 1<<10-1) allSatisfy: [:i | (1<<4+i)>>5+ (1<<4+i)>>5 = (i/31) rounded].
	(0 to: 1<<8-1) allSatisfy: [:i |  (1<<3+i)>>4+ (1<<3+i)>>4 = (i/15) rounded].
	(0 to: 1<<6-1) allSatisfy: [:i |  (1<<2+i)>>3+ (1<<2+i)>>3 = (i/7) rounded].
	(0 to: 1<<4-1) allSatisfy: [:i |  (1<<1+i)>>2+ (1<<1+i)>>2 = (i/3) rounded].
} allSatisfy: #yourself.

The nice thing is that above down-scaling operation can be multiplexed.
Suppose that we have p groups of 2*nBits M holding square-scale multiplication of each channel concatenated in a double-Word-Mul.

doubleWordMul = Mp .... M5 M3 M1

Note we arrange to have odd channels in low word, and even channels in high word.

We first form a groupMask on a word with (p+1)/2 groups of nBits alternating all one i and all zero o, oioi...ioi.

channelMask := 1 << nBits - 1.
groupMask := 0.
0 to: wordBits // (2 * nBits) do: [:i |
    groupMask = groupMask << (2 * nBits) + channelMask].

Where wordBits is the number of bits in a word (usually we want to operate on 32 bits words in BitBlt).

We form the doubleGroupMask on a double-word with p groups of 2*nBits oi:

doubleGroupMask := groupMask >> nBits.
doubleGroupMask := doubleGroupMask << wordBits + groupMask.

And we perform the division by scale:

doubleWordMul := (doubleWordMul >> nBits bitAnd: doubleGroupMask) + doubleWord >> nBits bitAnd: doubleGroupMask.

At this stage we obtain a double word containing scaled multiplicands interleaved with groups of nBits zeros:

o mp ... o m3 o m1

Now the final result can be obtained by shifting back:

doubleWordMul >> (wordBits - nBits) + (doubleWordMul bitAnd: groupMask)

The only problem remaining is how to obtain the squared-scale multiplicands. It would be easy to form the alternate even-odd channels for each src and dst operands:

doubleWordSrc := src >> nBits bitAnd: groupMask.
doubleWordSrc := doubleWordSrc << wordBits + (src bitAnd: groupMask).
doubleWordDst := dst >> nBits bitAnd: groupMask.
doubleWordDst := doubleWordDst << wordBits + (dst bitAnd: groupMask).

we now get o sp ... o s3 o s1 and 0 dp ... o d3 o d1, but we would now need a SIMD integer multiplication operating on groups of 2*nBits in parallel... We don't have that, at least in portable C code. So we still have to emulate it with a loop.

half := 1 << (nBits - 1).
shift := 0.
doubleWordMul := 0
0 to: nChannels - 1 do: [:i |
doubleWordMul := doubleWordMul + (((doubleWordSrc >> shift bitAnd: channelMask) * (doubleWordSrc >> shift bitAnd: channelMask) + half) << shift).
shift := shift + nBits + nBits].

We know that each operation cannot overflow on upper neighbour group of 2*nBits, because the maximum value is:

(1<<nBits-1) squared + (1 << (nBits-1)) = 1 << (2nBits) - (2(1<<nBits)) + (1 << (nBits-1)) - 1
< (1 << (2*nBits) - 1)

It remains the odd case of 16 bits depth, which has 3 groups of 5 bits and a leading zero.
I believe that above algorithm works without splitting in two half-words...
To be tested.

We have gathered the pieces for a correctly rounded almost-multiplexed rgbMul.

Somehow have our cake and eat it too.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <OpenSmalltalk/opensmalltalk-vm/issues/651@github.com>