[Vm-dev] [OpenSmalltalk/opensmalltalk-vm] BitBlt rgbMul is broken for depth 1 and not correctly rounded above depth 4 (Issue #651)

Thu Aug 25 10:08:55 UTC 2022

In depth 1, the resulting bits will always be 0.<br>
It's not a big problem because rgbMul is just a bitAnd operation at this depth.<br>
So a quick workaround would be to detect the case in BitBltSimulation

    destDepth = 1 ifTrue: [^self bitAnd: sourceWord with: destinationWord].

That would also accelerate the Bit BLock Transfer operation, so it's a good hack.

But there is more. What we want is multiply ratios in interval [0,1].

   dstRatio * srcRatio

Our implementation is scaled ratio (scaled by `1 << nBits - 1`):

    src := (srcRatio * scale) rounded.
    dst := (dstRatio * scale) rounded.

So what we want is:

    ((dst/scale) * (src/scale) * scale) rounded

that is:

    (dst*src / (1<<nBits-1)) rounded

Unfortunately, that's the other grief with the current implementation used for rounding:

    (dst+1)*(src+1) - 1 >> nBits

It only equals correctly rounded operation for depths 2 and 4.

For rounding we might use:

    (((dst/scale) * (src/scale) + 0.5) * scale) truncated.

that is expressed with truncated division:

    dst*src + (scale+1//2) // scale

So here is a nicer formulation for doing the job at any depth (including 5bits rgb channels for 16 bits depth) with correctly rounded division:

    aux := src * dst + (1 << (nBits - 1)). "add mid-scale for rounding"
    result := aux << (nBits - 1) + aux << (nBits -1). "divide by scale"

This is because instead of dividing by scale, we can multiply by shifted inverse (sort of double precision), then shift right.

    (2 to: 32) allSatisfy: [:nBits | (1 << (nBits * 2) / (1 << nBits - 1)) rounded = (1 << nBits + 1)].

Multiplying by this inverse is easy and cheap:

    x * (1 << nBits + 1) = (x << nBits + x).

And then applying the right shift `>> (2 * nBits)` is equivalent to:

    x >> nBits + x >> nBits.

We must first add 0.5 (scaled), that is `src * dst + (1 << (nBits -1))` - our formulation of aux, and we're done.

We verify:

    	{
		(0 to: 1<<20-1) allSatisfy: [:i | (1<<9+i)>>10+ (1<<9+i)>>10 = (i/1023) rounded].
		(0 to: 1<<18-1) allSatisfy: [:i | (1<<8+i)>>9+ (1<<8+i)>>9 = (i/511) rounded].
		(0 to: 1<<16-1) allSatisfy: [:i | (1<<7+i)>>8+ (1<<7+i)>>8 = (i/255) rounded].
		(0 to: 1<<14-1) allSatisfy: [:i | (1<<6+i)>>7+ (1<<6+i)>>7 = (i/127) rounded].
		(0 to: 1<<12-1) allSatisfy: [:i | (1<<5+i)>>6+ (1<<5+i)>>6 = (i/63) rounded].
		(0 to: 1<<10-1) allSatisfy: [:i | (1<<4+i)>>5+ (1<<4+i)>>5 = (i/31) rounded].
		(0 to: 1<<8-1) allSatisfy: [:i |  (1<<3+i)>>4+ (1<<3+i)>>4 = (i/15) rounded].
		(0 to: 1<<6-1) allSatisfy: [:i |  (1<<2+i)>>3+ (1<<2+i)>>3 = (i/7) rounded].
		(0 to: 1<<4-1) allSatisfy: [:i |  (1<<1+i)>>2+ (1<<1+i)>>2 = (i/3) rounded].
	} allSatisfy: #yourself.

The nice thing is that above down-scaling operation can be multiplexed.<b>
Suppose that we have p groups of 2*nBits `M` holding square-scale multiplication of each channel concatenated in a double-Word-Mul.

    doubleWordMul = Mp .... M5 M3 M1

Note we arrange to have odd channels in low word, and even channels in high word.

We first form a `groupMask` on a word with (p+1)/2 groups of nBits alternating all one `i` and all zero `o`, `oioi...ioi`.<br>

    channelMask := 1 << nBits - 1.
    groupMask := 0.
    0 to: wordBits // (2 * nBits) do: [:i |
        groupMask = groupMask << (2 * nBits) + channelMask].

Where wordBits is the number of bits in a word (usually we want to operate on 32 bits words in BitBlt).

We form the `doubleGroupMask` on a double-word with p groups of 2*nBits `oi`:

    doubleGroupMask := groupMask >> nBits.
    doubleGroupMask := doubleGroupMask << wordBits + groupMask.

And we perform the division by scale:

    doubleWordMul := (doubleWordMul >> nBits bitAnd: doubleGroupMask) + doubleWord >> nBits bitAnd: doubleGroupMask.

At this stage we obtain a double word containing scaled multiplicands interleaved with groups of nBits zeros:

   o mp ... o m3 o m1

Now the final result can be obtained by shifting back:

    doubleWordMul >> (wordBits - nBits) + (doubleWordMul bitAnd: groupMask)

The only problem remaining is how to obtain the squared-scale multiplicands. It would be easy to form the alternate even-odd channels for each src and dst operands:

   doubleWordSrc := src >> nBits bitAnd: groupMask.
   doubleWordSrc := doubleWordSrc << wordBits + (src bitAnd: groupMask).
   doubleWordDst := dst >> nBits bitAnd: groupMask.
   doubleWordDst := doubleWordDst << wordBits + (dst bitAnd: groupMask).

we now get `o sp ...  o s3 o s1` and `0 dp ... o d3 o d1`, but we would now need a SIMD integer multiplication operating on groups of 2*nBits in parallel... We don't have that, at least in portable C code. So we still have to emulate it with a loop.

   half := 1 << (nBits - 1).
   shift := 0.
   doubleWordMul  := 0
   0 to: nChannels - 1 do: [:i |
        doubleWordMul := doubleWordMul + (((doubleWordSrc >> shift bitAnd: channelMask) * (doubleWordSrc >> shift bitAnd: channelMask) + half) << shift).
        shift := shift + nBits + nBits].

We know that each operation cannot overflow on upper neighbour group of 2*nBits, because the maximum value is:

   (1<<nBits-1) squared + (1 << (nBits-1)) = 1 << (2*nBits) - (2*(1<<nBits)) + (1 << (nBits-1)) - 1
        < (1 << (2*nBits) - 1)

It remains the odd case of 16 bits depth, which has 3 groups of 5 bits and a leading zero.
I believe that above algorithm works without splitting in two half-words...
To be tested.

We have gathered the pieces for a correctly rounded almost-multiplexed rgbMul.<br>
Somehow have our cake and eat it too.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/651
You are receiving this because you are subscribed to this thread.

Message ID: <OpenSmalltalk/opensmalltalk-vm/issues/651 at github.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20220825/92fd233f/attachment.html>