[Vm-dev] Little speed up of BitBlt alpha-blending

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Mon Dec 23 15:43:04 UTC 2013


Currently we use a very clear but naive algorithm

    alpha := sourceWord >> 24.  "High 8 bits of source pixel"
    alpha = 0 ifTrue: [ ^ destinationWord ].
    alpha = 255 ifTrue: [ ^ sourceWord ].
    unAlpha := 255 - alpha.
    colorMask := 16rFF.
    result := 0.

    "red"
    shift := 0.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "green"
    shift := 8.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "blue"
    shift := 16.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "alpha (pre-multiplied)"
    shift := 24.
    blend := (alpha * 255) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    ^ result


Of course, the best we could do to improve it is using a native OS library
when it exists on the whole bitmap. I let this path apart, it can be
handled at platform specific source like tim did for Pi.
But still, with our own crafted bits, we could do better than current
implementation.
See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option
for a portable VM, it's better to call a native library if we cant to have
specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

1) multiplex Red+Blue and Alpha+Green computations
2) avoid division by 255

Here it is:

    "red and blue"
    blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
                ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.
    "divide by 255"
    blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
    result := blend.

    "alpha and green"
    blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) *
alpha) +
                ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) +
16rFE00FE.
    "divide by 255"
    blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
    result := result bitOr: blend<<8.
    ^ result

For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)
alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)
So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)
And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)
We are still free of overflow and can extend the //255 division trick to
32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure.
What do you think? Is this interesting?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20131223/5c3a8208/attachment.htm


More information about the Vm-dev mailing list