<div dir="ltr"><div>I have yet another version published at <a href="http://smalltalkhub.com/mc/nice/NiceVMExperiments/main">http://smalltalkhub.com/mc/nice/NiceVMExperiments/main</a> to handle a 2x16bits word in a single pass<br>
</div>Of course, speed up is spectacular on low depth (around x5 at depth 1, x2 at depth 4, only 40% at depth 16 and 30% at depth 32).<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">2013/12/24 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div class="gmail_extra"><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr"><<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div>I got an idea while reading this nice blending trick:<br>
<a href="http://stackoverflow.com/questions/20681590/how-does-this-color-blending-trick-work" target="_blank">http://stackoverflow.com/questions/20681590/how-does-this-color-blending-trick-work</a><br>
<br></div>We could do a similar thing when adding components in BitBlt<br>partitionedAdd: word1 to: word2 nBits: nBits nPartitions: nParts<br><br></div>For a simple example, say I have 2 parts of 4 bits<br></div><div>I will start reasonning with extended precision (thus on 16 bits here)</div>
<br></div>"The sum with carry is" sumWithCarry:=word1+word2.<br></div>"But carry must not be propagated past each component."<br></div>"So we must care of what happens at" carryOverflowMask := 2r100010000.<div>
<div><div><div>"The sum without any carry is" sumWithoutCarry:=word1 bitXor: word2.<br><div></div></div><div>"If the sum without carry differ from sum with carry then an overflow occured."<br></div><div>
"We can thus detect presence of a carry overflow:"<br>
</div><div>carryOverflow := (sumWithCarry bitXor: sumWithoutCarry) bitAnd: carryOverflowMask.<br></div><div>"If an undue carry occured, we just removet it:"<br></div><div>sumWithoutUndueCarry := sumWithCarry - carryOverflow.<br>
</div><div>"But in this case, previous component did overflow, we must saturate it at 2r1111, that is:" componentMask:=1<<nBits-1.<br></div><div>"we just have to multiply each carryOverflow bit with this componentMask:"<br>
</div><div>result := sumWithoutUndueCarry bitOr: carryOverflow >> nBit * componentMask.<br><br>-----------<br><br></div><div>"Generalization: note that 2r00010001 * componentMask = 2r11111111"<br></div><div>
"Correlatively, for arbitrary nBits and nParts parameters:" <br>
</div><div>carryOverflowMask := 1<<(nParts*nBits)-1//componentMask<<nBits.<br><br>-----------<br><br></div><div>In BitBlt, we want to operate on 32bits words.<br></div><div>We can use #usqLong (64 bits at least) all the way, and obtain branch free replacement for above method<br>
<br>
<var: #word1 type: 'unsigned int'><br> <var: #word2 type: 'unsigned int'><br> <var: #one type: #usqLong><br> <var: #componentMask type: #usqLong><br> <var: #carryOverflowMask type: #usqLong><br>
<var: #carryOverflow type: #usqLong><br>
<var: #sum type: #usqLong><br> one := 1.<br> componentMask := one<<nBits-1.<br> carryOverflowMask := one<<(nBits*nParts)-1//componentMask<<nBits.<br> sum := word1.<br> sum := sum + word2.<br>
carryOverflow := ((word1 bitXor: word2) bitXor: sum) bitAnd: carryOverflowMask.<br> ^sum-carryOverflow bitOr: carryOverflow>>nBits * componentMask<br><br>-----------<br><br></div><div>But maybe we can do better and express all with native unsigned int operations.<br>
</div><div>We must then look if bits at 2r10001000 could produce a carry.<br></div><div>We have a carry in next bit if at least 2 out of the 3 are true among word1 word2 carry...<br></div><div>That is (word1 bitAnd: word2) bitOr: ((word1 bitOr: word2) bitAnd: carry).<br>
<br>
<var: #word1 type: 'unsigned int'><br> <var: #word2 type: 'unsigned int'><br> <var: #one type: #usqLong> "because we cannot shift <<32 a usqInt in C..."<br> <var: #componentMask type: 'unsigned int'><br>
<var: #carryOverflowMask type: 'unsigned int'><br> <var: #carryOverflow type: 'unsigned int'><br> <var: #carry type: 'unsigned int'><br>
<var: #sum type: 'unsigned int'><br> one := 1.<br> componentMask := one<<nBits-1.<br> carryOverflowMask := one<<(nBits*nParts)-1.<br> carryOverflowMask := carryOverflowMask//componentMask<<(nBits-1).<br>
<div> sum := word1.<br> sum := sum + word2.<br></div><div> carry := (word1 bitXor: word2) bitXor: sum.</div><div>
carryOverflow := ((word1 bitAnd: word2) bitOr: ((word1 bitOr: word2) bitAnd: carry)) bitAnd: carryOverflowMask.<br> ^sum-(carryOverflow<<1) bitOr: carryOverflow>>(nBits-1) * componentMask<br></div><br>
</div><div>Maybe some good soul may help reducing # ops further, but we already have a branch free sum.<br></div><div>I did no performance test.<br></div><div>Does one have good BitBlt benchmark?<br></div><div><br></div>
</div></div></div></div>
</blockquote></div><br></div></div></div><div class="gmail_extra">Attached is what I could come with minimal ops...<br></div><div class="gmail_extra">The carry overflow can be written with this table<br><br></div><div class="gmail_extra">
0 0 0 0 1 1 1 1 word1<br></div><div class="gmail_extra">0 0 1 1 0 0 1 1 word2<br></div><div class="gmail_extra">0 1 0 1 0 1 0 1 carry<br>-------------------<br>0 0 0 1 0 1 0 1 next carry if at least 2 out of 3</div><div class="gmail_extra">
0 1 1 0 1 0 0 1 sum=word1+word2 (with carry)<br></div><div class="gmail_extra">0 0 0 1 0 1 x x (word1 bitOr: word2) bitAnd: sum bitInvert<br></div><div class="gmail_extra">0 0 0 0 0 0 1 1 (word1 bitAnd: word2)<br><br></div>
<div class="gmail_extra">The bit masks are not computed when depths are known constants<br></div></div></blockquote><div><br></div></div></div><div>Arghh, I posted too fast, it was bitInvert32 because that's the only thing that the CCodeGenerator understand by now.<br>
</div><div>Hmm this is wrong, ~x works for other int types than int32/uint32, but well...<br> <br></div></div><br></div></div>
</blockquote></div><br></div>