<div dir="ltr"><div>Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really understand ARM assembler, but it furiously look like the same tricks were applied:<br><br> AlphaBlend32_32_init<br> MOV ht_info, #1<br>
MOV ht, #0<br> ORR ht_info, ht_info, ht_info, LSL #16 ; &10001<br> MEND<br><br> MACRO<br> AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2, $known_not_transp<br> [ "$known_not_transp" = ""<br>
MOVS $tmp2, $src, LSR #24 ; s_a<br> BEQ %FT09 ; fully transparent - use dst<br> ]<br> TEQ $tmp2, #&FF<br> BEQ %FT10 ; fully opaque - use src<br> UXTB $tmp0, $src, ROR #8 ; s_ag<br>
ORR $tmp0, $tmp0, #&FF0000<br> UXTB16 $tmp1, $src ; s_rb<br> MUL $tmp0, $tmp0, $tmp2<br> MUL $tmp1, $tmp1, $tmp2<br> RSB $tmp2, $tmp2, #&FF<br> UXTB16 $src, $dst, ROR #8 ; d_ag<br>
UXTB16 $dst, $dst ; d_rb<br> MLA $src, $src, $tmp2, $tmp0 ; ag<br> MLA $dst, $dst, $tmp2, $tmp1 ; rb<br> USUB16 $tmp0, $src, ht_info<br> UXTAB16 $src, $src, $src, ROR #8<br>
SEL $tmp1, ht_info, ht<br> UXTAB16 $src, $tmp1, $src, ROR #8<br> USUB16 $tmp0, $dst, ht_info<br> UXTAB16 $dst, $dst, $dst, ROR #8<br> SEL $tmp1, ht_info, ht<br> UXTAB16 $dst, $tmp1, $dst, ROR #8<br>
ORR $src, $dst, $src, LSL #8 ; recombine<br> B %FT10<br>09 MOV $src, $dst<br><br></div>Here is my latest slang version:<br><br> alpha := sourceWord >> 24. "High 8 bits of source pixel"<br>
alpha = 0 ifTrue: [ ^ destinationWord ].<br> alpha = 255 ifTrue: [ ^ sourceWord ].<br> unAlpha := 255 - alpha.<br><br> blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +<br> ((destinationWord bitAnd: 16rFF00FF) * unAlpha)<br>
+ 16rFF00FF. "blendRB red and blue"<br><br> blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +<br> ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha)<br>
+ 16rFF00FF. "blendRB alpha and green"<br><br> blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. "divide by 255"<br> blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF.<br>
result := blendRB bitOr: blendAG<<8.<br> ^ result<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I only measured gain of 25%, not 50%, maybe the division is a bit complex...<br></div><div class="HOEnZb">
<div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div><div><br><div class="gmail_quote">2013/12/23 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div><div>2013/12/23 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div>Currently we use a very clear but naive algorithm<br>
<br> alpha := sourceWord >> 24. "High 8 bits of source pixel"<br> alpha = 0 ifTrue: [ ^ destinationWord ].<br>
alpha = 255 ifTrue: [ ^ sourceWord ].<br> unAlpha := 255 - alpha.<br> colorMask := 16rFF.<br> result := 0.<br><br> "red"<br> shift := 0.<br> blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +<br>
((destinationWord>>shift bitAnd: colorMask) * unAlpha)<br> + 254 // 255 bitAnd: colorMask.<br> result := result bitOr: blend << shift.<br> "green"<br> shift := 8.<br>
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +<br> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)<br> + 254 // 255 bitAnd: colorMask.<br> result := result bitOr: blend << shift.<br>
"blue"<br> shift := 16.<br> blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +<br> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)<br> + 254 // 255 bitAnd: colorMask.<br>
result := result bitOr: blend << shift.<br> "alpha (pre-multiplied)"<br> shift := 24.<br> blend := (alpha * 255) +<br> ((destinationWord>>shift bitAnd: colorMask) * unAlpha)<br>
+ 254 // 255 bitAnd: colorMask.<br> result := result bitOr: blend << shift.<br> ^ result<br><br><br>Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi.<br>
</div>But still, with our own crafted bits, we could do better than current implementation.<br>See <a href="http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast" target="_blank">http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast</a><br>
<br>Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.<br><br>
</div>But there are two simple ideas we can recycle from above SO reference:<br><br></div>1) multiplex Red+Blue and Alpha+Green computations<br></div>2) avoid division by 255<br><br></div>Here it is:<br><br> "red and blue"<br>
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +<br> ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.<br> "divide by 255"<br> blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.<br>
</div></div></div></div></div></div></div></blockquote></div></div><div>I forgot to protect bitAnd: 16rFF00FF but you get the idea...<br> <br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><div><div><div><div><div><div>
result := blend.<br><br> "alpha and green"<br> blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +<br> ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.<br>
"divide by 255"<br> blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.<br></div></div></div></div></div></div></div></blockquote><div><br></div></div><div>bitAnd: 16rFF00FF too of course...<br>
<br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div> result := result bitOr: blend<<8.<br>
^ result<br><br></div>For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)<br>
alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)<br></div>So when we multiplex non adjacent components, we're safe from overflow.<br><br></div>Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)<br></div>
And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)<br></div>We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).<br><br></div>I expect roughly a x2 factor in throughput, but it's hard to measure.<br>
<div><div><div><div><div><div>What do you think? Is this interesting?<br></div></div></div></div></div></div></div>
</blockquote></div></div><br></div></div>
</blockquote></div></div></div>Find corresponding code attached</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>