<div dir="ltr"><div>Ah, I&#39;m reading BitBltArmSimdAlphaBlend.s right now, I can&#39;t really understand ARM assembler, but it furiously look like the same tricks were applied:<br><br>        AlphaBlend32_32_init<br>        MOV     ht_info, #1<br>

        MOV     ht, #0<br>        ORR     ht_info, ht_info, ht_info, LSL #16 ; &amp;10001<br>        MEND<br><br>        MACRO<br>        AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2, $known_not_transp<br>      [ &quot;$known_not_transp&quot; = &quot;&quot;<br>

        MOVS    $tmp2, $src, LSR #24      ; s_a<br>        BEQ     %FT09 ; fully transparent - use dst<br>      ]<br>        TEQ     $tmp2, #&amp;FF<br>        BEQ     %FT10 ; fully opaque - use src<br>        UXTB    $tmp0, $src, ROR #8       ; s_ag<br>

        ORR     $tmp0, $tmp0, #&amp;FF0000<br>        UXTB16  $tmp1, $src               ; s_rb<br>        MUL     $tmp0, $tmp0, $tmp2<br>        MUL     $tmp1, $tmp1, $tmp2<br>        RSB     $tmp2, $tmp2, #&amp;FF<br>        UXTB16  $src, $dst, ROR #8        ; d_ag<br>

        UXTB16  $dst, $dst                ; d_rb<br>        MLA     $src, $src, $tmp2, $tmp0  ; ag<br>        MLA     $dst, $dst, $tmp2, $tmp1  ; rb<br>        USUB16  $tmp0, $src, ht_info<br>        UXTAB16 $src, $src, $src, ROR #8<br>

        SEL     $tmp1, ht_info, ht<br>        UXTAB16 $src, $tmp1, $src, ROR #8<br>        USUB16  $tmp0, $dst, ht_info<br>        UXTAB16 $dst, $dst, $dst, ROR #8<br>        SEL     $tmp1, ht_info, ht<br>        UXTAB16 $dst, $tmp1, $dst, ROR #8<br>

        ORR     $src, $dst, $src, LSL #8  ; recombine<br>        B       %FT10<br>09      MOV     $src, $dst<br><br></div>Here is my latest slang version:<br><br>    alpha := sourceWord &gt;&gt; 24.  &quot;High 8 bits of source pixel&quot;<br>

    alpha = 0 ifTrue: [ ^ destinationWord ].<br>    alpha = 255 ifTrue: [ ^ sourceWord ].<br>    unAlpha := 255 - alpha.<br><br>    blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +<br>                ((destinationWord bitAnd: 16rFF00FF) * unAlpha)<br>

                + 16rFF00FF.    &quot;blendRB red and blue&quot;<br><br>    blendAG := (((sourceWord&gt;&gt; 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +<br>                ((destinationWord&gt;&gt;8 bitAnd: 16rFF00FF) * unAlpha)<br>

                + 16rFF00FF.    &quot;blendRB alpha and green&quot;<br><br>    blendRB := blendRB + (blendRB - 16r10001 &gt;&gt; 8 bitAnd: 16rFF00FF) &gt;&gt; 8 bitAnd: 16rFF00FF.    &quot;divide by 255&quot;<br>    blendAG := blendAG + (blendAG - 16r10001 &gt;&gt; 8 bitAnd: 16rFF00FF) &gt;&gt; 8 bitAnd: 16rFF00FF.<br>

    result := blendRB bitOr: blendAG&lt;&lt;8.<br>    ^ result<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I only measured gain of 25%, not 50%, maybe the division is a bit complex...<br></div><div class="HOEnZb">

<div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div><div><br><div class="gmail_quote">2013/12/23 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div><div>2013/12/23 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div>Currently we use a very clear but naive algorithm<br>


<br>    alpha := sourceWord &gt;&gt; 24.  &quot;High 8 bits of source pixel&quot;<br>    alpha = 0 ifTrue: [ ^ destinationWord ].<br>

    alpha = 255 ifTrue: [ ^ sourceWord ].<br>    unAlpha := 255 - alpha.<br>    colorMask := 16rFF.<br>    result := 0.<br><br>    &quot;red&quot;<br>    shift := 0.<br>    blend := ((sourceWord &gt;&gt; shift bitAnd: colorMask) * alpha) +<br>


                ((destinationWord&gt;&gt;shift bitAnd: colorMask) * unAlpha)<br>                 + 254 // 255 bitAnd: colorMask.<br>    result := result bitOr: blend &lt;&lt; shift.<br>    &quot;green&quot;<br>    shift := 8.<br>


    blend := ((sourceWord &gt;&gt; shift bitAnd: colorMask) * alpha) +<br>                ((destinationWord&gt;&gt;shift bitAnd: colorMask) * unAlpha)<br>                 + 254 // 255 bitAnd: colorMask.<br>    result := result bitOr: blend &lt;&lt; shift.<br>


    &quot;blue&quot;<br>    shift := 16.<br>    blend := ((sourceWord &gt;&gt; shift bitAnd: colorMask) * alpha) +<br>                ((destinationWord&gt;&gt;shift bitAnd: colorMask) * unAlpha)<br>                 + 254 // 255 bitAnd: colorMask.<br>


    result := result bitOr: blend &lt;&lt; shift.<br>    &quot;alpha (pre-multiplied)&quot;<br>    shift := 24.<br>    blend := (alpha * 255) +<br>                ((destinationWord&gt;&gt;shift bitAnd: colorMask) * unAlpha)<br>


                 + 254 // 255 bitAnd: colorMask.<br>    result := result bitOr: blend &lt;&lt; shift.<br>    ^ result<br><br><br>Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi.<br>


</div>But still, with our own crafted bits, we could do better than current implementation.<br>See <a href="http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast" target="_blank">http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast</a><br>


<br>Using specific hardware instructions by ourselves is not really an option for a portable VM, it&#39;s better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.<br><br>


</div>But there are two simple ideas we can recycle from above SO reference:<br><br></div>1) multiplex Red+Blue and Alpha+Green computations<br></div>2) avoid division by 255<br><br></div>Here it is:<br><br>    &quot;red and blue&quot;<br>


    blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +<br>                ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.<br>    &quot;divide by 255&quot;<br>    blend := blend + 16r10001 + (blend &gt;&gt; 8 bitAnd: 16rFF00FF) &gt;&gt; 8.<br>


</div></div></div></div></div></div></div></blockquote></div></div><div>I forgot to protect  bitAnd: 16rFF00FF but you get the idea...<br> <br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir="ltr"><div><div><div><div><div><div>

    result := blend.<br><br>    &quot;alpha and green&quot;<br>    blend := (((sourceWord&gt;&gt; 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +<br>                ((destinationWord&gt;&gt;8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.<br>


    &quot;divide by 255&quot;<br>    blend := blend + 16r10001 + (blend &gt;&gt; 8 bitAnd: 16rFF00FF) &gt;&gt; 8.<br></div></div></div></div></div></div></div></blockquote><div><br></div></div><div>bitAnd: 16rFF00FF too of course...<br>


 <br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div>    result := result bitOr: blend&lt;&lt;8.<br>


    ^ result<br><br></div>For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)<br>

alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)<br></div>So when we multiplex non adjacent components, we&#39;re safe from overflow.<br><br></div>Now for division by 255 we are also safe: when adding 1 -&gt; (1..16rFF00)<br></div>


And when adding blend&gt;&gt;8 bitAnd 16rFF -&gt; (1..16rFFFF)<br></div>We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).<br><br></div>I expect roughly a x2 factor in throughput, but it&#39;s hard to measure.<br>


<div><div><div><div><div><div>What do you think? Is this interesting?<br></div></div></div></div></div></div></div>

</blockquote></div></div><br></div></div>

</blockquote></div></div></div>Find corresponding code attached</div></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>