[Vm-dev] Re: Little speed up of BitBlt alpha-blending
Nicolas Cellier
nicolas.cellier.aka.nice at gmail.com
Tue Dec 24 21:54:08 UTC 2013
Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really
understand ARM assembler, but it furiously look like the same tricks were
applied:
AlphaBlend32_32_init
MOV ht_info, #1
MOV ht, #0
ORR ht_info, ht_info, ht_info, LSL #16 ; &10001
MEND
MACRO
AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2,
$known_not_transp
[ "$known_not_transp" = ""
MOVS $tmp2, $src, LSR #24 ; s_a
BEQ %FT09 ; fully transparent - use dst
]
TEQ $tmp2, #&FF
BEQ %FT10 ; fully opaque - use src
UXTB $tmp0, $src, ROR #8 ; s_ag
ORR $tmp0, $tmp0, #&FF0000
UXTB16 $tmp1, $src ; s_rb
MUL $tmp0, $tmp0, $tmp2
MUL $tmp1, $tmp1, $tmp2
RSB $tmp2, $tmp2, #&FF
UXTB16 $src, $dst, ROR #8 ; d_ag
UXTB16 $dst, $dst ; d_rb
MLA $src, $src, $tmp2, $tmp0 ; ag
MLA $dst, $dst, $tmp2, $tmp1 ; rb
USUB16 $tmp0, $src, ht_info
UXTAB16 $src, $src, $src, ROR #8
SEL $tmp1, ht_info, ht
UXTAB16 $src, $tmp1, $src, ROR #8
USUB16 $tmp0, $dst, ht_info
UXTAB16 $dst, $dst, $dst, ROR #8
SEL $tmp1, ht_info, ht
UXTAB16 $dst, $tmp1, $dst, ROR #8
ORR $src, $dst, $src, LSL #8 ; recombine
B %FT10
09 MOV $src, $dst
Here is my latest slang version:
alpha := sourceWord >> 24. "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
((destinationWord bitAnd: 16rFF00FF) * unAlpha)
+ 16rFF00FF. "blendRB red and blue"
blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) +
((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha)
+ 16rFF00FF. "blendRB alpha and green"
blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
bitAnd: 16rFF00FF. "divide by 255"
blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
bitAnd: 16rFF00FF.
result := blendRB bitOr: blendAG<<8.
^ result
2013/12/24 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
> I only measured gain of 25%, not 50%, maybe the division is a bit
> complex...
>
>
> 2013/12/24 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>
>>
>> 2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>>
>>>
>>> 2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>
>>>
>>>> Currently we use a very clear but naive algorithm
>>>>
>>>> alpha := sourceWord >> 24. "High 8 bits of source pixel"
>>>> alpha = 0 ifTrue: [ ^ destinationWord ].
>>>> alpha = 255 ifTrue: [ ^ sourceWord ].
>>>> unAlpha := 255 - alpha.
>>>> colorMask := 16rFF.
>>>> result := 0.
>>>>
>>>> "red"
>>>> shift := 0.
>>>> blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>> + 254 // 255 bitAnd: colorMask.
>>>> result := result bitOr: blend << shift.
>>>> "green"
>>>> shift := 8.
>>>> blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>> + 254 // 255 bitAnd: colorMask.
>>>> result := result bitOr: blend << shift.
>>>> "blue"
>>>> shift := 16.
>>>> blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>>>> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>> + 254 // 255 bitAnd: colorMask.
>>>> result := result bitOr: blend << shift.
>>>> "alpha (pre-multiplied)"
>>>> shift := 24.
>>>> blend := (alpha * 255) +
>>>> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>>>> + 254 // 255 bitAnd: colorMask.
>>>> result := result bitOr: blend << shift.
>>>> ^ result
>>>>
>>>>
>>>> Of course, the best we could do to improve it is using a native OS
>>>> library when it exists on the whole bitmap. I let this path apart, it can
>>>> be handled at platform specific source like tim did for Pi.
>>>> But still, with our own crafted bits, we could do better than current
>>>> implementation.
>>>> See
>>>> http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
>>>>
>>>> Using specific hardware instructions by ourselves is not really an
>>>> option for a portable VM, it's better to call a native library if we cant
>>>> to have specific optimizations, so i let SSE instructions apart.
>>>>
>>>> But there are two simple ideas we can recycle from above SO reference:
>>>>
>>>> 1) multiplex Red+Blue and Alpha+Green computations
>>>> 2) avoid division by 255
>>>>
>>>> Here it is:
>>>>
>>>> "red and blue"
>>>> blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
>>>> ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
>>>> 16rFE00FE.
>>>> "divide by 255"
>>>> blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>>>>
>>> I forgot to protect bitAnd: 16rFF00FF but you get the idea...
>>>
>>>
>>>> result := blend.
>>>>
>>>> "alpha and green"
>>>> blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
>>>> alpha) +
>>>> ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) +
>>>> 16rFE00FE.
>>>> "divide by 255"
>>>> blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>>>>
>>>
>>> bitAnd: 16rFF00FF too of course...
>>>
>>>
>>>> result := result bitOr: blend<<8.
>>>> ^ result
>>>>
>>>> For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)
>>>> alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)
>>>> So when we multiplex non adjacent components, we're safe from overflow.
>>>>
>>>> Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)
>>>> And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)
>>>> We are still free of overflow and can extend the //255 division trick
>>>> to 32bit word (the formula given on SO is for 16bit only).
>>>>
>>>> I expect roughly a x2 factor in throughput, but it's hard to measure.
>>>> What do you think? Is this interesting?
>>>>
>>>
>>> Find corresponding code attached
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20131224/a17e84de/attachment-0001.htm
More information about the Vm-dev
mailing list