<div dir="ltr"><div>I have yet another version published at <a href="http://smalltalkhub.com/mc/nice/NiceVMExperiments/main">http://smalltalkhub.com/mc/nice/NiceVMExperiments/main</a> to handle a 2x16bits word in a single pass<br>

</div>Of course, speed up is spectacular on low depth (around x5 at depth 1, x2 at depth 4, only 40% at depth 16 and 30% at depth 32).<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">2013/12/24 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div class="gmail_extra"><br><div class="gmail_quote">2013/12/24 Nicolas Cellier <span dir="ltr">&lt;<a href="mailto:nicolas.cellier.aka.nice@gmail.com" target="_blank">nicolas.cellier.aka.nice@gmail.com</a>&gt;</span><br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div>I got an idea while reading this nice blending trick:<br>


<a href="http://stackoverflow.com/questions/20681590/how-does-this-color-blending-trick-work" target="_blank">http://stackoverflow.com/questions/20681590/how-does-this-color-blending-trick-work</a><br>


<br></div>We could do a similar thing when adding components in BitBlt<br>partitionedAdd: word1 to: word2 nBits: nBits nPartitions: nParts<br><br></div>For a simple example, say I have 2 parts of 4 bits<br></div><div>I will start reasonning with extended precision (thus on 16 bits here)</div>


<br></div>&quot;The sum with carry is&quot; sumWithCarry:=word1+word2.<br></div>&quot;But carry must not be propagated past each component.&quot;<br></div>&quot;So we must care of what happens at&quot; carryOverflowMask := 2r100010000.<div>


<div><div><div>&quot;The sum without any carry is&quot; sumWithoutCarry:=word1 bitXor: word2.<br><div></div></div><div>&quot;If the sum without carry differ from sum with carry then an overflow occured.&quot;<br></div><div>


&quot;We can thus detect presence of a carry overflow:&quot;<br>

</div><div>carryOverflow := (sumWithCarry bitXor: sumWithoutCarry) bitAnd: carryOverflowMask.<br></div><div>&quot;If an undue carry occured, we just removet it:&quot;<br></div><div>sumWithoutUndueCarry := sumWithCarry - carryOverflow.<br>


</div><div>&quot;But in this case, previous component did overflow, we must saturate it at 2r1111, that is:&quot; componentMask:=1&lt;&lt;nBits-1.<br></div><div>&quot;we just have to multiply each carryOverflow bit with this componentMask:&quot;<br>


</div><div>result := sumWithoutUndueCarry bitOr: carryOverflow &gt;&gt; nBit * componentMask.<br><br>-----------<br><br></div><div>&quot;Generalization: note that 2r00010001 * componentMask = 2r11111111&quot;<br></div><div>


&quot;Correlatively, for arbitrary nBits and nParts parameters:&quot; <br>


</div><div>carryOverflowMask := 1&lt;&lt;(nParts*nBits)-1//componentMask&lt;&lt;nBits.<br><br>-----------<br><br></div><div>In BitBlt, we want to operate on 32bits words.<br></div><div>We can use #usqLong (64 bits at least) all the way, and obtain branch free replacement for above method<br>


   <br>

    &lt;var: #word1 type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #word2 type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #one type: #usqLong&gt;<br>    &lt;var: #componentMask type: #usqLong&gt;<br>    &lt;var: #carryOverflowMask type: #usqLong&gt;<br>


    &lt;var: #carryOverflow type: #usqLong&gt;<br>

    &lt;var: #sum type: #usqLong&gt;<br>    one := 1.<br>    componentMask := one&lt;&lt;nBits-1.<br>    carryOverflowMask := one&lt;&lt;(nBits*nParts)-1//componentMask&lt;&lt;nBits.<br>    sum := word1.<br>    sum := sum + word2.<br>


    carryOverflow := ((word1 bitXor: word2) bitXor: sum) bitAnd: carryOverflowMask.<br>    ^sum-carryOverflow bitOr: carryOverflow&gt;&gt;nBits * componentMask<br><br>-----------<br><br></div><div>But maybe we can do better and express all with native unsigned int operations.<br>


</div><div>We must then look if bits at 2r10001000 could produce a carry.<br></div><div>We have a carry in next bit if at least 2 out of the 3 are true among word1 word2 carry...<br></div><div>That is (word1 bitAnd: word2) bitOr: ((word1 bitOr: word2) bitAnd: carry).<br>


<br>

    &lt;var: #word1 type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #word2 type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #one type: #usqLong&gt; &quot;because we cannot shift &lt;&lt;32 a usqInt in C...&quot;<br>    &lt;var: #componentMask type: &#39;unsigned int&#39;&gt;<br>


    &lt;var: #carryOverflowMask type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #carryOverflow type: &#39;unsigned int&#39;&gt;<br>    &lt;var: #carry type: &#39;unsigned int&#39;&gt;<br>

    &lt;var: #sum type: &#39;unsigned int&#39;&gt;<br>    one := 1.<br>    componentMask := one&lt;&lt;nBits-1.<br>    carryOverflowMask := one&lt;&lt;(nBits*nParts)-1.<br>    carryOverflowMask := carryOverflowMask//componentMask&lt;&lt;(nBits-1).<br>


<div>    sum := word1.<br>    sum := sum + word2.<br></div><div>    carry := (word1 bitXor: word2) bitXor: sum.</div><div>

    carryOverflow := ((word1 bitAnd: word2) bitOr: ((word1 bitOr: word2) bitAnd: carry)) bitAnd: carryOverflowMask.<br>    ^sum-(carryOverflow&lt;&lt;1) bitOr: carryOverflow&gt;&gt;(nBits-1) * componentMask<br></div><br>


</div><div>Maybe some good soul may help reducing # ops further, but we already have a branch free sum.<br></div><div>I did no performance test.<br></div><div>Does one have good BitBlt benchmark?<br></div><div><br></div>


</div></div></div></div>

</blockquote></div><br></div></div></div><div class="gmail_extra">Attached is what I could come with minimal ops...<br></div><div class="gmail_extra">The carry overflow can be written with this table<br><br></div><div class="gmail_extra">


0 0 0 0 1 1 1 1 word1<br></div><div class="gmail_extra">0 0 1 1 0 0 1 1 word2<br></div><div class="gmail_extra">0 1 0 1 0 1 0 1 carry<br>-------------------<br>0 0 0 1 0 1 0 1 next carry if at least 2 out of 3</div><div class="gmail_extra">


0 1 1 0 1 0 0 1 sum=word1+word2 (with carry)<br></div><div class="gmail_extra">0 0 0 1 0 1 x x (word1 bitOr: word2) bitAnd: sum bitInvert<br></div><div class="gmail_extra">0 0 0 0 0 0 1 1 (word1 bitAnd: word2)<br><br></div>


<div class="gmail_extra">The bit masks are not computed when depths are known constants<br></div></div></blockquote><div><br></div></div></div><div>Arghh, I posted too fast, it was bitInvert32 because that&#39;s the only thing that the CCodeGenerator understand by now.<br>


</div><div>Hmm this is wrong, ~x works for other int types than int32/uint32, but well...<br> <br></div></div><br></div></div>

</blockquote></div><br></div>