Little speed up of BitBlt alpha-blending

List overview All Threads
Download

newer

older

WordArray at: signedness issue in...

Help with squeak on Raspberry pi

Nicolas Cellier

23 Dec 2013 23 Dec '13

4:43 p.m.

Currently we use a very clear but naive algorithm

alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0.

"red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result

Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

1) multiplex Red+Blue and Alpha+Green computations 2) avoid division by 255

Here it is:

"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8. result := blend.

"alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8. result := result bitOr: blend<<8. ^ result

For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?

Attachments:

attachment.html (text/html — 3.8 KB)

Show replies by date

Nicolas Cellier

23 Dec 23 Dec

4:56 p.m.

2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...

Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
colorMask := 16rFF.
result := 0.

"red"
shift := 0.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"green"
shift := 8.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"blue"
shift := 16.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"alpha (pre-multiplied)"
shift := 24.
blend := (alpha * 255) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

multiplex Red+Blue and Alpha+Green computations

avoid division by 255

Here it is:
"red and blue"
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
            ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.

I forgot to protect bitAnd: 16rFF00FF but you get the idea...

...

result := blend.

"alpha and green"
blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.

bitAnd: 16rFF00FF too of course...

...

result := result bitOr: blend<<8.
^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?

Nicolas Cellier

24 Dec 24 Dec

3:58 a.m.

2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...

2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
colorMask := 16rFF.
result := 0.

"red"
shift := 0.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"green"
shift := 8.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"blue"
shift := 16.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"alpha (pre-multiplied)"
shift := 24.
blend := (alpha * 255) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

multiplex Red+Blue and Alpha+Green computations

avoid division by 255

Here it is:
"red and blue"
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
            ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...

...
 result := blend.

"alpha and green"
blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...

...
result := result bitOr: blend<<8.
^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached

Nicolas Cellier

5:59 a.m.

I only measured gain of 25%, not 50%, maybe the division is a bit complex...

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...

2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
colorMask := 16rFF.
result := 0.

"red"
shift := 0.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"green"
shift := 8.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"blue"
shift := 16.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"alpha (pre-multiplied)"
shift := 24.
blend := (alpha * 255) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

multiplex Red+Blue and Alpha+Green computations

avoid division by 255

Here it is:
"red and blue"
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
            ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...

...
 result := blend.

"alpha and green"
blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...

...
result := result bitOr: blend<<8.
^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached

Nicolas Cellier

10:54 p.m.

Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really understand ARM assembler, but it furiously look like the same tricks were applied:

AlphaBlend32_32_init MOV ht_info, #1 MOV ht, #0 ORR ht_info, ht_info, ht_info, LSL #16 ; &10001 MEND

MACRO AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2, $known_not_transp [ "$known_not_transp" = "" MOVS $tmp2, $src, LSR #24 ; s_a BEQ %FT09 ; fully transparent - use dst ] TEQ $tmp2, #&FF BEQ %FT10 ; fully opaque - use src UXTB $tmp0, $src, ROR #8 ; s_ag ORR $tmp0, $tmp0, #&FF0000 UXTB16 $tmp1, $src ; s_rb MUL $tmp0, $tmp0, $tmp2 MUL $tmp1, $tmp1, $tmp2 RSB $tmp2, $tmp2, #&FF UXTB16 $src, $dst, ROR #8 ; d_ag UXTB16 $dst, $dst ; d_rb MLA $src, $src, $tmp2, $tmp0 ; ag MLA $dst, $dst, $tmp2, $tmp1 ; rb USUB16 $tmp0, $src, ht_info UXTAB16 $src, $src, $src, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $src, $tmp1, $src, ROR #8 USUB16 $tmp0, $dst, ht_info UXTAB16 $dst, $dst, $dst, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $dst, $tmp1, $dst, ROR #8 ORR $src, $dst, $src, LSL #8 ; recombine B %FT10 09 MOV $src, $dst

Here is my latest slang version:

alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha.

blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB red and blue"

blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB alpha and green"

blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. "divide by 255" blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. result := blendRB bitOr: blendAG<<8. ^ result

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...

I only measured gain of 25%, not 50%, maybe the division is a bit complex...

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
colorMask := 16rFF.
result := 0.

"red"
shift := 0.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"green"
shift := 8.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"blue"
shift := 16.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"alpha (pre-multiplied)"
shift := 24.
blend := (alpha * 255) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

multiplex Red+Blue and Alpha+Green computations

avoid division by 255

Here it is:
"red and blue"
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
            ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...

...
 result := blend.

"alpha and green"
blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...

...
result := result bitOr: blend<<8.
^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached

Nicolas Cellier

27 Dec 27 Dec

6:06 p.m.

Capitalized at http://bugs.squeak.org/view.php?id=7803

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...

Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really understand ARM assembler, but it furiously look like the same tricks were applied:
    AlphaBlend32_32_init
    MOV     ht_info, #1
    MOV     ht, #0
    ORR     ht_info, ht_info, ht_info, LSL #16 ; &10001
    MEND

    MACRO
    AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2,
$known_not_transp [ "$known_not_transp" = "" MOVS $tmp2, $src, LSR #24 ; s_a BEQ %FT09 ; fully transparent - use dst ] TEQ $tmp2, #&FF BEQ %FT10 ; fully opaque - use src UXTB $tmp0, $src, ROR #8 ; s_ag ORR $tmp0, $tmp0, #&FF0000 UXTB16 $tmp1, $src ; s_rb MUL $tmp0, $tmp0, $tmp2 MUL $tmp1, $tmp1, $tmp2 RSB $tmp2, $tmp2, #&FF UXTB16 $src, $dst, ROR #8 ; d_ag UXTB16 $dst, $dst ; d_rb MLA $src, $src, $tmp2, $tmp0 ; ag MLA $dst, $dst, $tmp2, $tmp1 ; rb USUB16 $tmp0, $src, ht_info UXTAB16 $src, $src, $src, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $src, $tmp1, $src, ROR #8 USUB16 $tmp0, $dst, ht_info UXTAB16 $dst, $dst, $dst, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $dst, $tmp1, $dst, ROR #8 ORR $src, $dst, $src, LSL #8 ; recombine B %FT10 09 MOV $src, $dst

Here is my latest slang version:
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.

blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) +

            ((destinationWord bitAnd: 16rFF00FF) * unAlpha)
            + 16rFF00FF.    "blendRB red and blue"

blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) +
            ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha)
            + 16rFF00FF.    "blendRB alpha and green"

blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
bitAnd: 16rFF00FF. "divide by 255" blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. result := blendRB bitOr: blendAG<<8. ^ result

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
I only measured gain of 25%, not 50%, maybe the division is a bit complex...

2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com

...
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24.  "High 8 bits of source pixel"
alpha = 0 ifTrue: [ ^ destinationWord ].
alpha = 255 ifTrue: [ ^ sourceWord ].
unAlpha := 255 - alpha.
colorMask := 16rFF.
result := 0.

"red"
shift := 0.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"green"
shift := 8.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"blue"
shift := 16.
blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
"alpha (pre-multiplied)"
shift := 24.
blend := (alpha * 255) +
            ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
             + 254 // 255 bitAnd: colorMask.
result := result bitOr: blend << shift.
^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

multiplex Red+Blue and Alpha+Green computations

avoid division by 255

Here it is:
"red and blue"
blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
            ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...

...
 result := blend.

"alpha and green"
blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...

...
result := result bitOr: blend<<8.
^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached

3790

Age (days ago)

3794

Last active (days ago)

vm-dev@lists.squeakfoundation.org

5 comments

1 participants

tags (0)

participants (1)

Nicolas Cellier