[Vm-dev] Re: Little speed up of BitBlt alpha-blending

Fri Dec 27 17:06:25 UTC 2013

Capitalized at http://bugs.squeak.org/view.php?id=7803


2013/12/24 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>

> Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really
> understand ARM assembler, but it furiously look like the same tricks were
> applied:
>
>         AlphaBlend32_32_init
>         MOV     ht_info, #1
>         MOV     ht, #0
>         ORR     ht_info, ht_info, ht_info, LSL #16 ; &10001
>         MEND
>
>         MACRO
>         AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2,
> $known_not_transp
>       [ "$known_not_transp" = ""
>         MOVS    $tmp2, $src, LSR #24      ; s_a
>         BEQ     %FT09 ; fully transparent - use dst
>       ]
>         TEQ     $tmp2, #&FF
>         BEQ     %FT10 ; fully opaque - use src
>         UXTB    $tmp0, $src, ROR #8       ; s_ag
>         ORR     $tmp0, $tmp0, #&FF0000
>         UXTB16  $tmp1, $src               ; s_rb
>         MUL     $tmp0, $tmp0, $tmp2
>         MUL     $tmp1, $tmp1, $tmp2
>         RSB     $tmp2, $tmp2, #&FF
>         UXTB16  $src, $dst, ROR #8        ; d_ag
>         UXTB16  $dst, $dst                ; d_rb
>         MLA     $src, $src, $tmp2, $tmp0  ; ag
>         MLA     $dst, $dst, $tmp2, $tmp1  ; rb
>         USUB16  $tmp0, $src, ht_info
>         UXTAB16 $src, $src, $src, ROR #8
>         SEL     $tmp1, ht_info, ht
>         UXTAB16 $src, $tmp1, $src, ROR #8
>         USUB16  $tmp0, $dst, ht_info
>         UXTAB16 $dst, $dst, $dst, ROR #8
>         SEL     $tmp1, ht_info, ht
>         UXTAB16 $dst, $tmp1, $dst, ROR #8
>         ORR     $src, $dst, $src, LSL #8  ; recombine
>         B       %FT10
> 09      MOV     $src, $dst
>
> Here is my latest slang version:
>
>
>     alpha := sourceWord >> 24.  "High 8 bits of source pixel"
>     alpha = 0 ifTrue: [ ^ destinationWord ].
>     alpha = 255 ifTrue: [ ^ sourceWord ].
>     unAlpha := 255 - alpha.
>
>     blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
>
>                 ((destinationWord bitAnd: 16rFF00FF) * unAlpha)
>                 + 16rFF00FF.    "blendRB red and blue"
>
>     blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
> alpha) +
>
>                 ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha)
>                 + 16rFF00FF.    "blendRB alpha and green"
>
>     blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
> bitAnd: 16rFF00FF.    "divide by 255"
>     blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
> bitAnd: 16rFF00FF.
>     result := blendRB bitOr: blendAG<<8.
>     ^ result
>
>
> 2013/12/24 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>
>> I only measured gain of 25%, not 50%, maybe the division is a bit
>> complex...
>>
>>
>> 2013/12/24 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>>
>>>
>>> 2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>>>
>>>>
>>>> 2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>>>>
>>>>> Currently we use a very clear but naive algorithm
>>>>>
>>>>>     alpha := sourceWord >> 24.  "High 8 bits of source pixel"
>>>>>     alpha = 0 ifTrue: [ ^ destinationWord ].
>>>>>     alpha = 255 ifTrue: [ ^ sourceWord ].
>>>>>     unAlpha := 255 - alpha.
>>>>>     colorMask := 16rFF.
>>>>>     result := 0.
>>>>>
>>>>>     "red"
>>>>>     shift := 0.
>>>>>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>>>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>>>                  + 254 // 255 bitAnd: colorMask.
>>>>>     result := result bitOr: blend << shift.
>>>>>     "green"
>>>>>     shift := 8.
>>>>>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>>>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>>>                  + 254 // 255 bitAnd: colorMask.
>>>>>     result := result bitOr: blend << shift.
>>>>>     "blue"
>>>>>     shift := 16.
>>>>>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>>>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>>>                  + 254 // 255 bitAnd: colorMask.
>>>>>     result := result bitOr: blend << shift.
>>>>>     "alpha (pre-multiplied)"
>>>>>     shift := 24.
>>>>>     blend := (alpha * 255) +
>>>>>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>>>                  + 254 // 255 bitAnd: colorMask.
>>>>>     result := result bitOr: blend << shift.
>>>>>     ^ result
>>>>>
>>>>>
>>>>> Of course, the best we could do to improve it is using a native OS
>>>>> library when it exists on the whole bitmap. I let this path apart, it can
>>>>> be handled at platform specific source like tim did for Pi.
>>>>> But still, with our own crafted bits, we could do better than current
>>>>> implementation.
>>>>> See
>>>>> http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
>>>>>
>>>>> Using specific hardware instructions by ourselves is not really an
>>>>> option for a portable VM, it's better to call a native library if we cant
>>>>> to have specific optimizations, so i let SSE instructions apart.
>>>>>
>>>>> But there are two simple ideas we can recycle from above SO reference:
>>>>>
>>>>> 1) multiplex Red+Blue and Alpha+Green computations
>>>>> 2) avoid division by 255
>>>>>
>>>>> Here it is:
>>>>>
>>>>>     "red and blue"
>>>>>     blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
>>>>>                 ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
>>>>> 16rFE00FE.
>>>>>     "divide by 255"
>>>>>     blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>>>>>
>>>> I forgot to protect  bitAnd: 16rFF00FF but you get the idea...
>>>>
>>>>
>>>>>      result := blend.
>>>>>
>>>>>     "alpha and green"
>>>>>     blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
>>>>> alpha) +
>>>>>                 ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) +
>>>>> 16rFE00FE.
>>>>>     "divide by 255"
>>>>>     blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>>>>>
>>>>
>>>> bitAnd: 16rFF00FF too of course...
>>>>
>>>>
>>>>>     result := result bitOr: blend<<8.
>>>>>     ^ result
>>>>>
>>>>> For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)
>>>>> alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)
>>>>> So when we multiplex non adjacent components, we're safe from overflow.
>>>>>
>>>>> Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)
>>>>> And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)
>>>>> We are still free of overflow and can extend the //255 division trick
>>>>> to 32bit word (the formula given on SO is for 16bit only).
>>>>>
>>>>> I expect roughly a x2 factor in throughput, but it's hard to measure.
>>>>> What do you think? Is this interesting?
>>>>>
>>>>
>>>> Find corresponding code attached
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20131227/af239e13/attachment-0001.htm