Hi Eliot, (sorry long mail but I think you will enjoy this discussion; read
it)
Looking deeper into tight loops performance to get some boost out of Sista,
I have questions and remarks.
First thing is when I compare 3 versions or array copies (code at the
bottom of mail).
Version [1] : copyArray : written in plain Smalltalk
Version [2] : fastCopyArray : sista generated code (only unchecked
operations)
Version [3] : primCopyArray : simply call the machine code primitive
For small arrays everything is fine, sista code is good. But for large
arrays I get the following results (here a copy of 1000 elements):
1) 85k/second
2) 390k/second
3) 2,000k/second
So the sista version is ~5x compared to pure smalltalk code, but ~5x slower
to the primitive.
My guess is that the reason for this is that the hand written primitive is
written with a 7 instructions tight loop (2 additions, 1 read, 1 write, 2
jumps, 1 comp), while the sista generated code requires 6 temp read, 1 temp
write and a couple cheap instructions in addition. We discussed earlier
than register allocation might bring a 20% performance boost, but it seems
tight loops are quite important and in this case the performance difference
is around 5x (500%).
So big question is that should I had more macro-instructions (let's call
them stubs) in the sista extended bytecode set such as bytesEquals which
bulkCompares the bytes of 2 objects, since performance difference is bigger
than expected, or should I wait for better register allocation algorithm ?
I believe hand-written stub will always be faster than sista generated code
(since a couple things can't be expressed at the bytecode level), but based
on multiple benchmarks I think if the better register allocator generates
code with the same number of memory read-write (including temp access on
stack) the difference should be less than 20% performance difference as we
expected.
It's not clear how many stubs would be needed. I have a prototype of array
copy, other things such as indexOf or fillObject are needed. Maybe I can
add only 3-4 for now to show massive speed-ups like we did with
bytesEquals.
Let's note that we could try to share the stubs with normal primitives
(primStringReplace with the arraycopy stub for example) to avoid too much
maintenance cost. The problem is that normal primitives have complexity
with all the checks at the beginning, then a generic case (copy without
knowing anything about the operands) while sista stubs have no checks, but
complexity is in generating better code when some operands are constants.
So it's not really the same thing.
Second question is about RTL code for tight loops. I looked into V8 and it
seems that for tight loop they try to use backward conditional instead of
forward conditional + backward unconditional. What do you think will be the
difference if I do it for array copy (the copy is guaranteed not to be
empty at this point, example below Original-New) ? I wonder if conditional
back jumps are not slower than conditional forward jumps or something like
that (I am always confused with branch performance, and the performance is
different depending on caches so small benchs may not be relevant). What's
your take on this ?
And then the question is that should I be able to generate that kind of
loop from the sista back-end (I don't really like it, I prefer to limit the
number of back-jumps else it makes some things harder) ? Else I can do it
in the stubs for most critical performance things and ignore it in other
cases.
*Original*
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
Total 7 instructions in the loop
*New*
instr := cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit CmpR: startReg R: stopReg.
cogit JumpGreaterOrEqual: instr.
jumpEmpty jmpTarget: cogit genPrimReturn.
Total 6 instructions in the loop
But a conditional back-jump.
NB : Eliot I tried incrementing the pointer instead of the index and use a
0 displacement read/writes and it made no performance difference on Intel.
Saves a register in the loop but we don't need them. It makes a difference
in the inlined stub however since other registers can be used for other
things in the method where it's inlined.
*[1]* Smalltalk code
1 to: (y - x + 1) do: [ :i |
array2 at: y2 + i put: (array at: x + i) ].
*[2] *(sista bytecodes of the tight loop)
32 <45> pushTemp: 5
33 <46> pushTemp: 6
34 <F8 F3 87> smiLessOrEqual:
37 <EF 1B> jumpFalse: 66
39 <43> pushTemp: 3
40 <45> pushTemp: 5
41 <40> pushTemp: 0
42 <45> pushTemp: 5
43 <F8 10 88> pointerAt:
46 <F8 B8 8B> pointerAt:put:
49 <D8> pop
50 <45> pushTemp: 5
51 <E1 00 E8 01> pushConstant: 1
55 <F8 D0 87> smiAdd:
58 <D5> popIntoTemp: 5
59 <E1 FF E8 DE> pushConstant: -34
63 <F8 70 97> backjumpNoInterrupt
*[3]* primitive call
array2 replaceFrom: x to: y with: array startingAt: y2
Current tight loop RTL:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
--
Clément Béra
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Eliot Miranda uploaded a new version of VMMaker to project VM Maker:
http://source.squeak.org/VMMaker/VMMaker.oscog-eem.2308.mcz
==================== Summary ====================
Name: VMMaker.oscog-eem.2308
Author: eem
Time: 5 January 2018, 12:19:53.562181 am
UUID: 6c41bece-0d81-4a69-a677-bf6e8f275674
Ancestors: VMMaker.oscog-eem.2307
Slang: Make TParseNode>>comment; do what it implies and accept a string argument, instead of insistign clients know the internal contract is a sequence of comment strings.
Tweak the recent inlining comment change; no need to comment if what's being inlined is a named constant.
=============== Diff against VMMaker.oscog-eem.2307 ===============
Item was changed:
----- Method: TMethod>>tryToInlineMethodExpressionsIn: (in category 'inlining') -----
tryToInlineMethodExpressionsIn: aCodeGen
"Expand any (complete) inline methods sent by this method as receivers or parameters.
Answer if anything was inlined."
| sendsToInline |
sendsToInline := Dictionary new: 100.
aCodeGen
pushScope: declarations
while: [parseTree
nodesDo:
[:node|
(self inlineableFunctionCall: node in: aCodeGen) ifTrue:
[(self inlineFunctionCall: node in: aCodeGen) ifNotNil:
[:replacement|
+ (replacement isConstant
+ and: [replacement isDefine not
+ and: [replacement value isNumber
+ and: [replacement comment isNil]]]) ifTrue:
- (replacement isConstant and: [replacement value isNumber and: [replacement comment isNil]]) ifTrue:
[replacement comment: node selector].
sendsToInline at: node put: replacement]]]
unless: "Don't inline the arguments to asserts to keep the asserts readable"
[:node|
node isSend
and: [node selector == #cCode:inSmalltalk:
or: [aCodeGen isAssertSelector: node selector]]]].
sendsToInline isEmpty ifTrue:
[^false].
self replaceNodesIn: sendsToInline.
^true!
Item was changed:
----- Method: TParseNode>>comment: (in category 'accessing') -----
comment: aComment
+ "Ugh,. comment is a sequence of strings."
+ comment := aComment isString ifTrue: [{aComment}] ifFalse: [aComment]!
-
- comment := aComment !