If we wanted to handle 16bits depth without split (thus 6 channels in parallel on a 32bits word), that would get a bit tricky due to the dead-bit. We would need two different masks, because shifting a mask does not produce the other one:
lowMask := 2r00000011111000000111110000011111. "green2-red1-blue1" highMask := 2r01111100000111110000001111100000. "red2-blue2-green1" highWordShift := 27. doubleGroupMask := 2r0000001111100000011111000001111100000011111000000111110000011111. "highMask << highWordShift + lowMask" doubleWord1 := word1 bitAnd: highMask. doubleWord2 := word2 bitAnd: highMask. doubleWord1 := doubleWord1 << highWordShift + (word1 bitAnd: lowMask). doubleWord2 := doubleWord2 << highWordShift + (word2 bitAnd: lowMask).
Then the shifts for accessing each component in double word would be tricky, either 10 or 11 in the loop (0 10 21 32 42 53).
extraShift := 2r10110. shift := 0. 0 to: 5 do: [:i | doubleWordMul := doubleWordMul + (((doubleWordSrc >> shift bitAnd: channelMask) * (doubleWordSrc >> shift bitAnd: channelMask) + half) << shift). shift := shift + (2 * nBits) + (extraShift >> i bitAnd: 1)].
The rest should work unchanged.