Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0.
"red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
1) multiplex Red+Blue and Alpha+Green computations 2) avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8. result := blend.
"alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8. result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0. "red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
- multiplex Red+Blue and Alpha+Green computations
- avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...
result := blend. "alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...
result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0. "red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
- multiplex Red+Blue and Alpha+Green computations
- avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...
result := blend. "alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...
result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached
I only measured gain of 25%, not 50%, maybe the division is a bit complex...
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0. "red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
- multiplex Red+Blue and Alpha+Green computations
- avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...
result := blend. "alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...
result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached
Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really understand ARM assembler, but it furiously look like the same tricks were applied:
AlphaBlend32_32_init MOV ht_info, #1 MOV ht, #0 ORR ht_info, ht_info, ht_info, LSL #16 ; &10001 MEND
MACRO AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2, $known_not_transp [ "$known_not_transp" = "" MOVS $tmp2, $src, LSR #24 ; s_a BEQ %FT09 ; fully transparent - use dst ] TEQ $tmp2, #&FF BEQ %FT10 ; fully opaque - use src UXTB $tmp0, $src, ROR #8 ; s_ag ORR $tmp0, $tmp0, #&FF0000 UXTB16 $tmp1, $src ; s_rb MUL $tmp0, $tmp0, $tmp2 MUL $tmp1, $tmp1, $tmp2 RSB $tmp2, $tmp2, #&FF UXTB16 $src, $dst, ROR #8 ; d_ag UXTB16 $dst, $dst ; d_rb MLA $src, $src, $tmp2, $tmp0 ; ag MLA $dst, $dst, $tmp2, $tmp1 ; rb USUB16 $tmp0, $src, ht_info UXTAB16 $src, $src, $src, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $src, $tmp1, $src, ROR #8 USUB16 $tmp0, $dst, ht_info UXTAB16 $dst, $dst, $dst, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $dst, $tmp1, $dst, ROR #8 ORR $src, $dst, $src, LSL #8 ; recombine B %FT10 09 MOV $src, $dst
Here is my latest slang version:
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha.
blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB red and blue"
blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB alpha and green"
blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. "divide by 255" blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. result := blendRB bitOr: blendAG<<8. ^ result
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
I only measured gain of 25%, not 50%, maybe the division is a bit complex...
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0. "red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
- multiplex Red+Blue and Alpha+Green computations
- avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...
result := blend. "alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...
result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached
Capitalized at http://bugs.squeak.org/view.php?id=7803
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Ah, I'm reading BitBltArmSimdAlphaBlend.s right now, I can't really understand ARM assembler, but it furiously look like the same tricks were applied:
AlphaBlend32_32_init MOV ht_info, #1 MOV ht, #0 ORR ht_info, ht_info, ht_info, LSL #16 ; &10001 MEND MACRO AlphaBlend32_32_1pixel $src, $dst, $tmp0, $tmp1, $tmp2,
$known_not_transp [ "$known_not_transp" = "" MOVS $tmp2, $src, LSR #24 ; s_a BEQ %FT09 ; fully transparent - use dst ] TEQ $tmp2, #&FF BEQ %FT10 ; fully opaque - use src UXTB $tmp0, $src, ROR #8 ; s_ag ORR $tmp0, $tmp0, #&FF0000 UXTB16 $tmp1, $src ; s_rb MUL $tmp0, $tmp0, $tmp2 MUL $tmp1, $tmp1, $tmp2 RSB $tmp2, $tmp2, #&FF UXTB16 $src, $dst, ROR #8 ; d_ag UXTB16 $dst, $dst ; d_rb MLA $src, $src, $tmp2, $tmp0 ; ag MLA $dst, $dst, $tmp2, $tmp1 ; rb USUB16 $tmp0, $src, ht_info UXTAB16 $src, $src, $src, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $src, $tmp1, $src, ROR #8 USUB16 $tmp0, $dst, ht_info UXTAB16 $dst, $dst, $dst, ROR #8 SEL $tmp1, ht_info, ht UXTAB16 $dst, $tmp1, $dst, ROR #8 ORR $src, $dst, $src, LSL #8 ; recombine B %FT10 09 MOV $src, $dst
Here is my latest slang version:
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. blendRB := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB red and blue" blendAG := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) +
((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFF00FF. "blendRB alpha and green" blendRB := blendRB + (blendRB - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8
bitAnd: 16rFF00FF. "divide by 255" blendAG := blendAG + (blendAG - 16r10001 >> 8 bitAnd: 16rFF00FF) >> 8 bitAnd: 16rFF00FF. result := blendRB bitOr: blendAG<<8. ^ result
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
I only measured gain of 25%, not 50%, maybe the division is a bit complex...
2013/12/24 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
2013/12/23 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Currently we use a very clear but naive algorithm
alpha := sourceWord >> 24. "High 8 bits of source pixel" alpha = 0 ifTrue: [ ^ destinationWord ]. alpha = 255 ifTrue: [ ^ sourceWord ]. unAlpha := 255 - alpha. colorMask := 16rFF. result := 0. "red" shift := 0. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "green" shift := 8. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "blue" shift := 16. blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. "alpha (pre-multiplied)" shift := 24. blend := (alpha * 255) + ((destinationWord>>shift bitAnd: colorMask) * unAlpha) + 254 // 255 bitAnd: colorMask. result := result bitOr: blend << shift. ^ result
Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi. But still, with our own crafted bits, we could do better than current implementation. See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.
But there are two simple ideas we can recycle from above SO reference:
- multiplex Red+Blue and Alpha+Green computations
- avoid division by 255
Here it is:
"red and blue" blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) + ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect bitAnd: 16rFF00FF but you get the idea...
result := blend. "alpha and green" blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) + ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE. "divide by 255" blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
bitAnd: 16rFF00FF too of course...
result := result bitOr: blend<<8. ^ result
For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01) alpha*B1+unAlpha*B2+254 is in (0..16rFEFF) So when we multiplex non adjacent components, we're safe from overflow.
Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00) And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF) We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).
I expect roughly a x2 factor in throughput, but it's hard to measure. What do you think? Is this interesting?
Find corresponding code attached
vm-dev@lists.squeakfoundation.org