[squeak-dev] The Inbox: Kernel-nice.1235.mcz

Wed May 15 22:08:37 UTC 2019

Nicolas Cellier uploaded a new version of Kernel to project The Inbox:
http://source.squeak.org/inbox/Kernel-nice.1235.mcz

==================== Summary ====================

Name: Kernel-nice.1235
Author: nice
Time: 16 May 2019, 12:08:27.320094 am
UUID: dd74b668-0767-441e-a952-5c73af6327be
Ancestors: Kernel-nice.1234

Round accelerated arithmetic chunks to upper multiple of 4 bytes rather than to lower.

I believe that this marginally improves the performance because it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk.

If someone wants to confirm...
It's better to tune the threshold before benchmarking.
See LargeArithmeticBench from http://www.squeaksource.com/STEM.html and http://smallissimo.blogspot.com/2019/05 blog for details.

=============== Diff against Kernel-nice.1234 ===============

Item was changed:
  ----- Method: LargePositiveInteger>>digitDiv21: (in category 'private') -----
  digitDiv21: anInteger
  	"This is part of the recursive division algorithm from Burnikel - Ziegler
  	Divide a two limbs receiver by 1 limb dividend
  	Each limb is decomposed in two halves of p bytes (8*p bits)
  	so as to continue the recursion"

  	| p qr1 qr2 |
+ 	"split in two parts, rounded to upper multiple of 4"
+ 	p := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	p := (anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
  	p < self class thresholdForDiv21 ifTrue: [^(self digitDiv: anInteger neg: false) collect: #normalize].
  	qr1 := (self butLowestNDigits: p) digitDiv32: anInteger.
  	qr2 := (self lowestNDigits: p) + (qr1 last bitShift: 8*p) digitDiv32: anInteger.
  	qr2 at: 1 put: (qr2 at: 1) + ((qr1 at: 1) bitShift: 8*p).
  	^qr2!

Item was changed:
  ----- Method: LargePositiveInteger>>digitDiv32: (in category 'private') -----
  digitDiv32: anInteger
  	"This is part of the recursive division algorithm from Burnikel - Ziegler
  	Divide 3 limb (a2,a1,a0) by 2 limb (b1,b0).
  	Each limb is made of p bytes (8*p bits).
  	This step transforms the division problem into multiplication
  	It must use a fast multiplyByInteger: to be worth the overhead costs."

  	| a2 b1 d p q qr r |
+ 	"split in two parts, rounded to upper multiple of 4"
+ 	p :=(anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	p :=(anInteger digitLength + 1 bitShift: -1) bitClear: 2r11.
  	(a2 := self butLowestNDigits: 2*p) 
  		< (b1 := anInteger butLowestNDigits: p)
  		ifTrue:
  			[qr := (self butLowestNDigits: p) digitDiv21: b1.
  			q := qr first.
  			r := qr last]
  		ifFalse:
  			[q := (1 bitShift: 8*p) - 1.
  			r := (self butLowestNDigits: p) - (b1 bitShift: 8*p) + b1].
  	d := q * (anInteger lowestNDigits: p).
  	r := (self lowestNDigits: p) + (r bitShift: 8*p) - d.
  	[r < 0]
  		whileTrue:
  			[q := q - 1.
  			r := r + anInteger].
  	^Array with: q with: r
  	!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul22: (in category 'private') -----
  digitMul22: anInteger
  	"Multiply after decomposing each operand in two parts, using Karatsuba algorithm.
  	Karatsuba perform only 3 multiplications, leading to a cost O(n^3 log2)
  	asymptotically better than super O(n^2) for large number of digits n.
  	See https://en.wikipedia.org/wiki/Karatsuba_algorithm"

  	| half xLow xHigh yLow yHigh low mid high |
+ 	"split each in two parts, rounded to upper multiple of 4"
+ 	half := (anInteger digitLength + 7 bitShift: -3) bitShift: 2.
- 	"Divide each integer in two halves"
- 	half := (anInteger digitLength + 1 bitShift: -1)  bitClear: 2r11.
  	xLow := self lowestNDigits: half.
  	xHigh := self butLowestNDigits: half.
  	yLow := anInteger lowestNDigits: half.
  	yHigh := anInteger butLowestNDigits: half.

  	"Karatsuba trick: perform with 3 multiplications instead of 4"
  	low := xLow multiplyByInteger: yLow.
  	high := xHigh multiplyByInteger: yHigh.
  	mid := high + low + (xHigh - xLow multiplyByInteger: yLow - yHigh).

  	"Sum the parts of decomposition"
  	^(high isZero
  		ifTrue: [low]
  		ifFalse: [(high bitShift: 16*half)
  			inplaceAddNonOverlapping: low digitShiftBy: 0])
  		+ (mid bitShift: 8*half)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul23: (in category 'private') -----
  digitMul23: anInteger
  	"Multiply after decomposing the receiver in 2 parts, and multiplicand in 3 parts.
  	Only invoke when anInteger digitLength between: 3/2 and 5/2 self digitLength.
  	This is a variant of Toom-Cook algorithm (see digitMul33:)"

  	| half x1 x0 y2 y1 y0 y20 z3 z2 z1 z0 |
+ 	"divide self in 2 and operand in 3 parts, rounded to upper multiple of 4"
+ 	half := ( self digitLength + 7 bitShift: -3) bitShift: 2.
- 	"divide self in 2 and operand in 3 parts"
- 	half := ( self digitLength + 1 bitShift: -1) bitClear: 2r11.
  	x1 := self butLowestNDigits: half.
  	x0 := self lowestNDigits: half.
  	y2 := anInteger butLowestNDigits: half * 2.
  	y1 := anInteger copyDigitsFrom: half + 1 to: half * 2.
  	y0 := anInteger lowestNDigits: half.

  	"Toom trick: 4 multiplications instead of 6"
  	y20 := y2 + y0.
  	z3 := x1 multiplyByInteger: y2.
  	z2 := x0 - x1 multiplyByInteger: y20 - y1.
  	z1 := x0 + x1 multiplyByInteger: y20 + y1.
  	z0 := x0 multiplyByInteger: y0.

  	"Sum the parts of decomposition"
  	^z0 + ((z1 - z2 bitShift: -1) - z3 bitShift: 8*half)
  		+ (((z1 + z2 bitShift: -1) - z0) + (z3 bitShift: 8*half) bitShift: 16 * half)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMul33: (in category 'private') -----
  digitMul33: anInteger
  	"Multiply after decomposing each operand in 3 parts, using a Toom-Cooke algorithm.
  	Toom-Cooke is a generalization of Karatsuba divide and conquer algorithm.
  	See https://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication
  	Use a Bodrato-Zanoni variant for the choice of interpolation points and matrix inversion
  	See What about Toom-Cook matrices optimality? - Marco Bodrato, Alberto Zanoni - Oct. 2006
  	http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf"

  	| third x2 x1 x0 y2 y1 y0 y20 z4 z3 z2 z1 z0 x20 |
+ 	"divide both operands in 3 parts, rounded to upper multiple of 4"
+ 	third := anInteger digitLength + 11 // 12 bitShift: 2.
- 	"divide both operands in 3 parts"
- 	third := anInteger digitLength + 2 // 3 bitClear: 2r11.
  	x2 := self butLowestNDigits: third * 2.
  	x1 := self copyDigitsFrom: third + 1 to: third * 2.
  	x0 := self lowestNDigits: third.
  	y2 := anInteger butLowestNDigits: third * 2.
  	y1 := anInteger copyDigitsFrom: third + 1 to: third * 2.
  	y0 := anInteger lowestNDigits: third.

  	"Toom-3 trick: 5 multiplications instead of 9"
  	z0 := x0 multiplyByInteger: y0.
  	z4 := x2 multiplyByInteger: y2.
  	x20 := x2 + x0.
  	y20 := y2 + y0.
  	z1 := x20 + x1 multiplyByInteger: y20 + y1.
  	x20 := x20 - x1.
  	y20 := y20 - y1.
  	z2 := x20 multiplyByInteger: y20.
  	z3 := (x20 + x2 bitShift: 1) - x0 multiplyByInteger: (y20 + y2 bitShift: 1) - y0.

  	"Sum the parts of decomposition"
  	z3 := z3 - z1 quo: 3.
  	z1 := z1 - z2 bitShift: -1.
  	z2 := z2 - z0.

  	z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
  	z2 := z2 + z1 - z4.
  	z1 := z1 - z3.
  	^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)!

Item was changed:
  ----- Method: LargePositiveInteger>>digitMulSplit: (in category 'private') -----
  digitMulSplit: anInteger
  	"multiply digits when self and anInteger have not well balanced digitlength.
  	in this case, it is better to split the largest (anInteger) in several parts and recompose"

  	| xLen yLen split q r high mid low sizes |
  	yLen := anInteger digitLength.
  	xLen := self digitLength.
+ 	"divide in about 1.5 xLen, rounded to upper multiple of 4"
+ 	split := (xLen * 3 + 7 bitShift: -3) bitShift: 2.
- 	split := (xLen * 3 + 2 bitShift: -1) bitClear: 2r11.

  	"Arrange to sum non overlapping parts"
  	q := yLen // split.
  	q < 3 ifTrue: [^(0 to: yLen - 1 by: split) detectSum: [:yShift | (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split)) bitShift: 8 * yShift]].
  	r := yLen \\ split.
  	"allocate enough bytes, but not too much, in order to minimise normalize cost;
  	we could allocate xLen + yLen for each one as well"
  	sizes := {q-1*split. q*split. q*split+r}.
  	low  := Integer new: (sizes atWrap: 0 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  	mid := Integer new:  (sizes atWrap: 1 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  	high := Integer new: (sizes atWrap: 2 - (q\\3)) + xLen neg: self negative ~~ anInteger negative.
  	0 to: yLen - 1 by: 3 * split do: [:yShift |
  		low
  			inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  			digitShiftBy: yShift].
  	split to: yLen - 1 by: 3 * split do: [:yShift |
  		mid
  			inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  			digitShiftBy: yShift].
  	split * 2 to: yLen - 1 by: 3 * split do: [:yShift |
  		high
  			inplaceAddNonOverlapping: (self multiplyByInteger: (anInteger copyDigitsFrom: yShift + 1 to: yShift + split))
  			digitShiftBy: yShift].
  	^high normalize + mid normalize + low normalize!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByFourth (in category 'private') -----
  squaredByFourth
  	"Use a 4-way Toom-Cook divide and conquer algorithm to perform the multiplication.
  	See Asymmetric Squaring Formulae Jaewook Chung and M. Anwar Hasan
  	https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf"

  	| p a0 a1 a2 a3 a02 a13 s0 s1 s2 s3 s4 s5 s6 t2 t3 |
+ 	"divide in 4 parts, rounded to upper multiple of 4"
+ 	p := (self digitLength + 15 bitShift: -4) bitShift: 2.
- 	"divide in 4 parts"
- 	p := (self digitLength + 3 bitShift: -2) bitClear: 2r11.
  	a3 := self butLowestNDigits: p * 3.
  	a2 := self copyDigitsFrom: p * 2 + 1 to: p * 3.
  	a1 := self copyDigitsFrom: p + 1 to: p * 2.
  	a0 := self lowestNDigits: p.

  	"Toom-4 trick: 7 multiplications instead of 16"
  	a02 := a0 - a2.
  	a13 := a1 - a3.
  	s0 := a0 squared.
  	s1 := (a0 * a1) bitShift: 1.
  	s2 := (a02 + a13) * (a02 - a13).
  	s3 := ((a0 + a1) + (a2 + a3)) squared.
  	s4 := (a02 * a13) bitShift: 1.
  	s5 := (a3 * a2) bitShift: 1.
  	s6 := a3 squared.

  	"Interpolation"
  	t2 := s1 + s5.
  	t3 := (s2 + s3 + s4 bitShift: -1) - t2.
  	s3 := t2 - s4.
  	s4 := t3 - s0.
  	s2 := t3 - s2 - s6.

  	"Sum the parts of decomposition"
  	^s0 + (s1 bitShift: 8*p) + (s2 + (s3 bitShift: 8*p) bitShift: 16*p)
  	+(s4 + (s5 bitShift: 8*p) + (s6 bitShift: 16*p) bitShift: 32*p)

  "
  | a |
  a := 770 factorial-1.
  a digitLength.
  [a * a - a squaredToom4 = 0] assert.
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredToom4]] timeToRun] value /
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
  "!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByHalf (in category 'private') -----
  squaredByHalf
  	"Use a divide and conquer algorithm to perform the multiplication.
  	Split in two parts like Karatsuba, but economize 2 additions by using asymetrical product."

  	| half xHigh xLow low high mid |

+ 	"Divide digits in two halves rounded tp upper multiple of 4"
+ 	half := (self digitLength + 1 bitShift: -3) bitShift: 2.
- 	"Divide digits in two halves"
- 	half := self digitLength + 1 // 2 bitClear: 2r11.
  	xLow := self lowestNDigits: half.
  	xHigh := self butLowestNDigits: half.

  	"eventually use karatsuba"
  	low := xLow squared.
  	high := xHigh squared.
  	mid := xLow multiplyByInteger: xHigh.

  	"Sum the parts of decomposition"
  	^(high bitShift: 16*half)
  		inplaceAddNonOverlapping: low digitShiftBy: 0;
  		+ (mid bitShift: 8*half+1)

  "
  | a |
  a := 440 factorial-1.
  a digitLength.
  self assert: a * a - a squaredKaratsuba = 0.
  [Smalltalk garbageCollect.
  [2000 timesRepeat: [a squaredKaratsuba]] timeToRun] value /
  [Smalltalk garbageCollect.
  [2000 timesRepeat: [a * a]] timeToRun] value asFloat
  "!

Item was changed:
  ----- Method: LargePositiveInteger>>squaredByThird (in category 'private') -----
  squaredByThird
  	"Use a 3-way Toom-Cook divide and conquer algorithm to perform the multiplication"

  	| third x0 x1 x2 x20 z0 z1 z2 z3 z4 |
+ 	"divide in 3 parts, rounded to upper multiple of 4"
+ 	third := self digitLength + 11 // 3 bitShift: 2.
- 	"divide in 3 parts"
- 	third := self digitLength + 2 // 3 bitClear: 2r11.
  	x2 := self butLowestNDigits: third * 2.
  	x1 := self copyDigitsFrom: third + 1 to: third * 2.
  	x0 := self lowestNDigits: third.

  	"Toom-3 trick: 5 multiplications instead of 9"
  	z0 := x0 squared.
  	z4 := x2 squared.
  	x20 := x2 + x0.
  	z1 := (x20 + x1) squared.
  	x20 := x20 - x1.
  	z2 := x20 squared.
  	z3 := ((x20 + x2 bitShift: 1) - x0) squared.

  	"Sum the parts of decomposition"
  	z3 := z3 - z1 quo: 3.
  	z1 := z1 - z2 bitShift: -1.
  	z2 := z2 - z0.

  	z3 := (z2 - z3 bitShift: -1) + (z4 bitShift: 1).
  	z2 := z2 + z1 - z4.
  	z1 := z1 - z3.
  	^z0 + (z1 bitShift: 8*third) + (z2 bitShift: 16*third) + (z3 + (z4 bitShift: 8*third) bitShift: 24*third)

  "
  | a |
  a := 1400 factorial-1.
  a digitLength.
  self assert: a * a - a squaredToom3 = 0.
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredToom3]] timeToRun] value /
  [Smalltalk garbageCollect.
  [1000 timesRepeat: [a squaredKaratsuba]] timeToRun] value asFloat
  "!