[Vm-dev] Re: Little speed up of BitBlt alpha-blending

Mon Dec 23 15:56:07 UTC 2013

2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>

> Currently we use a very clear but naive algorithm
>
>     alpha := sourceWord >> 24.  "High 8 bits of source pixel"
>     alpha = 0 ifTrue: [ ^ destinationWord ].
>     alpha = 255 ifTrue: [ ^ sourceWord ].
>     unAlpha := 255 - alpha.
>     colorMask := 16rFF.
>     result := 0.
>
>     "red"
>     shift := 0.
>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>                  + 254 // 255 bitAnd: colorMask.
>     result := result bitOr: blend << shift.
>     "green"
>     shift := 8.
>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>                  + 254 // 255 bitAnd: colorMask.
>     result := result bitOr: blend << shift.
>     "blue"
>     shift := 16.
>     blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>                  + 254 // 255 bitAnd: colorMask.
>     result := result bitOr: blend << shift.
>     "alpha (pre-multiplied)"
>     shift := 24.
>     blend := (alpha * 255) +
>                 ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
>                  + 254 // 255 bitAnd: colorMask.
>     result := result bitOr: blend << shift.
>     ^ result
>
>
> Of course, the best we could do to improve it is using a native OS library
> when it exists on the whole bitmap. I let this path apart, it can be
> handled at platform specific source like tim did for Pi.
> But still, with our own crafted bits, we could do better than current
> implementation.
> See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast
>
> Using specific hardware instructions by ourselves is not really an option
> for a portable VM, it's better to call a native library if we cant to have
> specific optimizations, so i let SSE instructions apart.
>
> But there are two simple ideas we can recycle from above SO reference:
>
> 1) multiplex Red+Blue and Alpha+Green computations
> 2) avoid division by 255
>
> Here it is:
>
>     "red and blue"
>     blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
>                 ((destinationWord bitAnd: 16rFF00FF) * unAlpha) +
> 16rFE00FE.
>     "divide by 255"
>     blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>
I forgot to protect  bitAnd: 16rFF00FF but you get the idea...

>     result := blend.
>
>     "alpha and green"
>     blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
> alpha) +
>                 ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) +
> 16rFE00FE.
>     "divide by 255"
>     blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
>

bitAnd: 16rFF00FF too of course...

>     result := result bitOr: blend<<8.
>     ^ result
>
> For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)
> alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)
> So when we multiplex non adjacent components, we're safe from overflow.
>
> Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)
> And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)
> We are still free of overflow and can extend the //255 division trick to
> 32bit word (the formula given on SO is for 16bit only).
>
> I expect roughly a x2 factor in throughput, but it's hard to measure.
> What do you think? Is this interesting?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20131223/e4091a01/attachment.htm