# [Pkg] The Trunk: Kernel-nice.1235.mcz

commits at source.squeak.org commits at source.squeak.org
Thu May 23 19:02:53 UTC 2019

```Nicolas Cellier uploaded a new version of Kernel to project The Trunk:
http://source.squeak.org/trunk/Kernel-nice.1235.mcz

==================== Summary ====================

Name: Kernel-nice.1235
Author: nice
Time: 16 May 2019, 12:08:27.320094 am
UUID: dd74b668-0767-441e-a952-5c73af6327be
Ancestors: Kernel-nice.1234

Round accelerated arithmetic chunks to upper multiple of 4 bytes rather than to lower.

I believe that this marginally improves the performance because it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk.

If someone wants to confirm...
It's better to tune the threshold before benchmarking.
See LargeArithmeticBench from http://www.squeaksource.com/STEM.html and http://smallissimo.blogspot.com/2019/05 blog for details.

=============== Diff against Kernel-nice.1234 ===============

Item was changed:
----- Method: LargePositiveInteger>>digitDiv21: (in category 'private') -----
digitDiv21: anInteger
"This is part of the recursive division algorithm from Burnikel - Ziegler
Divide a two limbs receiver by 1 limb dividend
Each limb is decomposed in two halves of p bytes (8*p bits)
so as to continue the recursion"

| p qr1 qr2 |
+ 	"split in two parts, rounded to upper multiple of 4"
+ 	p := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	p := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
p < self class thresholdForDiv21 ifTrue: [^(self digitDiv: anInteger neg: false) collect: #normalize].
qr1 := (self butLowestNDigits: p) digitDiv32: anInteger.
qr2 := (self lowestNDigits: p) + (qr1 last bitShift: 8*p) digitDiv32: anInteger.
qr2 at: 1 put: (qr2 at: 1) + ((qr1 at: 1) bitShift: 8*p).
^qr2!

Item was changed:
----- Method: LargePositiveInteger>>digitDiv32: (in category 'private') -----
digitDiv32: anInteger
"This is part of the recursive division algorithm from Burnikel - Ziegler
Divide 3 limb (a2,a1,a0) by 2 limb (b1,b0).
Each limb is made of p bytes (8*p bits).
This step transforms the division problem into multiplication
It must use a fast multiplyByInteger: to be worth the overhead costs."

| a2 b1 d p q qr r |
+ 	"split in two parts, rounded to upper multiple of 4"
+ 	p :=(anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	p :=(anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
(a2 := self butLowestNDigits: 2*p)
< (b1 := anInteger butLowestNDigits: p)
ifTrue:
[qr := (self butLowestNDigits: p) digitDiv21: b1.
q := qr first.
r := qr last]
ifFalse:
[q := (1 bitShift: 8*p) - 1.
r := (self butLowestNDigits: p) - (b1 bitShift: 8*p) + b1].
d := q * (anInteger lowestNDigits: p).
r := (self lowestNDigits: p) + (r bitShift: 8*p) - d.
[r < 0]
whileTrue:
[q := q - 1.
r := r + anInteger].
^Array with: q with: r
!

Item was changed:
----- Method: LargePositiveInteger>>digitMul22: (in category 'private') -----
digitMul22: anInteger
"Multiply after decomposing each operand in two parts, using Karatsuba algorithm.
Karatsuba perform only 3 multiplications, leading to a cost O(n^3 log2)
asymptotically better than super O(n^2) for large number of digits n.
See https://en.wikipedia.org/wiki/Karatsuba_algorithm"

| half xLow xHigh yLow yHigh low mid high |
+ 	"split each in two parts, rounded to upper multiple of 4"
+ 	half := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	"Divide each integer in two halves"
- 	half := (anInteger digitLength + 1 bitShift: -1)  bitClear: 2r11.
xLow := self lowestNDigits: half.
xHigh := self butLowestNDigits: half.
yLow := anInteger lowestNDigits: half.
yHigh := anInteger butLowestNDigits: half.

"Karatsuba trick: perform with 3 multiplications instead of 4"
low := xLow multiplyByInteger: yLow.
high := xHigh multiplyByInteger: yHigh.
mid := high + low + (xHigh - xLow multiplyByInteger: yLow - yHigh).

"Sum the parts of decomposition"
^(high isZero
ifTrue: [low]
ifFalse: [(high bitShift: 16*half)
+ (mid bitShift: 8*half)!

Item was changed:
----- Method: LargePositiveInteger>>digitMul23: (in category 'private') -----
digitMul23: anInteger
"Multiply after decomposing the receiver in 2 parts, and multiplicand in 3 parts.
Only invoke when anInteger digitLength between: 3/2 and 5/2 self digitLength.
This is a variant of Toom-Cook algorithm (see digitMul33:)"

| half x1 x0 y2 y1 y0 y20 z3 z2 z1 z0 |
+ 	"divide self in 2 and operand in 3 parts, rounded to upper multiple of 4"
+ 	half := ( self digitLength + 7 bitShift: -3) bitShift: 2.
- 	"divide self in 2 and operand in 3 parts"
- 	half := ( self digitLength + 1 bitShift: -1) bitClear: 2r11.
x1 := self butLowestNDigits: half.
x0 := self lowestNDigits: half.
y2 := anInteger butLowestNDigits: half * 2.
y1 := anInteger copyDigitsFrom: half + 1 to: half * 2.
y0 := anInteger lowestNDigits: half.

"Toom trick: 4 multiplications instead of 6"
y20 := y2 + y0.
z3 := x1 multiplyByInteger: y2.
z2 := x0 - x1 multiplyByInteger: y20 - y1.
z1 := x0 + x1 multiplyByInteger: y20 + y1.
z0 := x0 multiplyByInteger: y0.

"Sum the parts of decomposition"
^z0 + ((z1 - z2 bitShift: -1) - z3 bitShift: 8*half)
+ (((z1 + z2 bitShift: -1) - z0) + (z3 bitShift: 8*half) bitShift: 16 * half)!

Item was changed:
----- Method: LargePositiveInteger>>digitMul33: (in category 'private') -----
digitMul33: anInteger
"Multiply after decomposing each operand in 3 parts, using a Toom-Cooke algorithm.
Toom-Cooke is a generalization of Karatsuba divide and conquer algorithm.
See https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication
Use a Bodrato-Zanoni variant for the choice of interpolation points and matrix inversion
See What about Toom-Cook matrices optimality? - Marco Bodrato, Alberto Zanoni - Oct. 2006

| third x2 x1 x0 y2 y1 y0 y20 z4 z3 z2 z1 z0 x20 |
+ 	"divide both operands in 3 parts, rounded to upper multiple of 4"
+ 	third := anInteger digitLength + 11 // 12 bitShift: 2.
- 	"divide both operands in 3 parts"
- 	third := anInteger digitLength + 2 // 3 bitClear: 2r11.
x2 := self butLowestNDigits: third * 2.
x1 := self copyDigitsFrom: third + 1 to: third * 2.
x0 := self lowestNDigits: third.
y2 := anInteger butLowestNDigits: third * 2.
y1 := anInteger copyDigitsFrom: third + 1 to: third * 2.
y0 := anInteger lowestNDigits: third.

"Toom-3 trick: 5 multiplications instead of 9"
z0 := x0 multiplyByInteger: y0.
z4 := x2 multiplyByInteger: y2.
x20 := x2 + x0.
y20 := y2 + y0.
z1 := x20 + x1 multiplyByInteger: y20 + y1.
x20 := x20 - x1.
y20 := y20 - y1.
z2 := x20 multiplyByInteger: y20.
z3 := (x20 + x2 bitShift: 1) - x0 multiplyByInteger: (y20 + y2 bitShift: 1) - y0.

"Sum the parts of decomposition"
z3 := z3 - z1 quo: 3.
z1 := z1 - z2 bitShift: -1.
z2 := z2 - z0.

z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
z2 := z2 + z1 - z4.
z1 := z1 - z3.
^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)!

Item was changed:
----- Method: LargePositiveInteger>>digitMulSplit: (in category 'private') -----
digitMulSplit: anInteger
"multiply digits when self and anInteger have not well balanced digitlength.
in this case, it is better to split the largest (anInteger) in several parts and recompose"

| xLen yLen split q r high mid low sizes |
yLen := anInteger digitLength.
xLen := self digitLength.
+ 	"divide in about 1.5 xLen, rounded to upper multiple of 4"
+ 	split := (xLen * 3 + 7 bitShift: -3) bitShift: 2.
- 	split := (xLen * 3 + 2 bitShift: -1) bitClear: 2r11.

"Arrange to sum non overlapping parts"
q := yLen // split.
q < 3 ifTrue: [^(0 to: yLen - 1 by: split) detectSum: [:yShift | (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) bitShift: 8 * yShift]].
r := yLen \\ split.
"allocate enough bytes, but not too much, in order to minimise normalize cost;
we could allocate xLen + yLen for each one as well"
sizes := {q-1*split. q*split. q*split+r}.
low  := Integer new: (sizes atWrap: 0 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
mid := Integer new:  (sizes atWrap: 1 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
high := Integer new: (sizes atWrap: 2 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
0 to: yLen - 1 by: 3 * split do: [:yShift |
low
inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
digitShiftBy: yShift].
split to: yLen - 1 by: 3 * split do: [:yShift |
mid
inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
digitShiftBy: yShift].
split * 2 to: yLen - 1 by: 3 * split do: [:yShift |
high
inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
digitShiftBy: yShift].
^high normalize + mid normalize + low normalize!

Item was changed:
----- Method: LargePositiveInteger>>squaredByFourth (in category 'private') -----
squaredByFourth
"Use a 4-way Toom-Cook divide and conquer algorithm to perform the multiplication.
See Asymmetric Squaring Formulae Jaewook Chung and M. Anwar Hasan
https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf"

| p a0 a1 a2 a3 a02 a13 s0 s1 s2 s3 s4 s5 s6 t2 t3 |
+ 	"divide in 4 parts, rounded to upper multiple of 4"
+ 	p := (self digitLength + 15 bitShift: -4) bitShift: 2.
- 	"divide in 4 parts"
- 	p := (self digitLength + 3 bitShift: -2) bitClear: 2r11.
a3 := self butLowestNDigits: p * 3.
a2 := self copyDigitsFrom: p * 2 + 1 to: p * 3.
a1 := self copyDigitsFrom: p + 1 to: p * 2.
a0 := self lowestNDigits: p.

"Toom-4 trick: 7 multiplications instead of 16"
a02 := a0 - a2.
a13 := a1 - a3.
s0 := a0 squared.
s1 := (a0 * a1) bitShift: 1.
s2 := (a02 + a13) * (a02 - a13).
s3 := ((a0 + a1) + (a2 + a3)) squared.
s4 := (a02 * a13) bitShift: 1.
s5 := (a3 * a2) bitShift: 1.
s6 := a3 squared.

"Interpolation"
t2 := s1 + s5.
t3 := (s2 + s3 + s4 bitShift: -1) - t2.
s3 := t2 - s4.
s4 := t3 - s0.
s2 := t3 - s2 - s6.

"Sum the parts of decomposition"
^s0 + (s1 bitShift: 8*p) + (s2 + (s3 bitShift: 8*p) bitShift: 16*p)
+(s4 + (s5 bitShift: 8*p) + (s6 bitShift: 16*p) bitShift: 32*p)

"
| a |
a := 770 factorial-1.
a digitLength.
[a * a - a squaredToom4 = 0] assert.
[Smalltalk garbageCollect.
[1000 timesRepeat: [a squaredToom4]] timeToRun] value /
[Smalltalk garbageCollect.
[1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
"!

Item was changed:
----- Method: LargePositiveInteger>>squaredByHalf (in category 'private') -----
squaredByHalf
"Use a divide and conquer algorithm to perform the multiplication.
Split in two parts like Karatsuba, but economize 2 additions by using asymetrical product."

| half xHigh xLow low high mid |

+ 	"Divide digits in two halves rounded tp upper multiple of 4"
+ 	half := (self digitLength + 1 bitShift: -3) bitShift: 2.
- 	"Divide digits in two halves"
- 	half := self digitLength + 1 // 2 bitClear: 2r11.
xLow := self lowestNDigits: half.
xHigh := self butLowestNDigits: half.

"eventually use karatsuba"
low := xLow squared.
high := xHigh squared.
mid := xLow multiplyByInteger: xHigh.

"Sum the parts of decomposition"
^(high bitShift: 16*half)
+ (mid bitShift: 8*half+1)

"
| a |
a := 440 factorial-1.
a digitLength.
self assert: a * a - a squaredKaratsuba = 0.
[Smalltalk garbageCollect.
[2000 timesRepeat: [a squaredKaratsuba]] timeToRun] value /
[Smalltalk garbageCollect.
[2000 timesRepeat: [a * a]] timeToRun] value asFloat
"!

Item was changed:
----- Method: LargePositiveInteger>>squaredByThird (in category 'private') -----
squaredByThird
"Use a 3-way Toom-Cook divide and conquer algorithm to perform the multiplication"

| third x0 x1 x2 x20 z0 z1 z2 z3 z4 |
+ 	"divide in 3 parts, rounded to upper multiple of 4"
+ 	third := self digitLength + 11 // 3 bitShift: 2.
- 	"divide in 3 parts"
- 	third := self digitLength + 2 // 3 bitClear: 2r11.
x2 := self butLowestNDigits: third * 2.
x1 := self copyDigitsFrom: third + 1 to: third * 2.
x0 := self lowestNDigits: third.

"Toom-3 trick: 5 multiplications instead of 9"
z0 := x0 squared.
z4 := x2 squared.
x20 := x2 + x0.
z1 := (x20 + x1) squared.
x20 := x20 - x1.
z2 := x20 squared.
z3 := ((x20 + x2 bitShift: 1) - x0) squared.

"Sum the parts of decomposition"
z3 := z3 - z1 quo: 3.
z1 := z1 - z2 bitShift: -1.
z2 := z2 - z0.

z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
z2 := z2 + z1 - z4.
z1 := z1 - z3.
^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)

"
| a |
a := 1400 factorial-1.
a digitLength.
self assert: a * a - a squaredToom3 = 0.
[Smalltalk garbageCollect.
[1000 timesRepeat: [a squaredToom3]] timeToRun] value /
[Smalltalk garbageCollect.
[1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
"!

```