[Vm-dev] Primitives, interpreterProxy, function calls, lto
Levente Uzonyi
leves at caesar.elte.hu
Wed Apr 18 21:54:04 UTC 2018
Hi All,
I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
(mostly done, I'm stuck with the tests).
My main goal was to add the necessary checks to make these primitives
safer.
But while I was doing that, I decided to optimize the code a bit by
getting rid of a few repeated function calls, notably stackValue() calls,
because those are pretty much the only ones that are called more than
once with the same argument from these primitives.
This kind of rewrite resulted in small, but measurable speedup (~5-15%),
despite of the new checks being in place.
Unfortunately, the generated assembly code for almost all interpeterProxy
methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that
the compiler could optimize them away, but I don't know how to achieve
that.
So, I decided to check if the C compiler could do that for us,m and yes,
gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't
optimize everything it could[2].
I wrote a small benchmark to see how lto and my rewrite affects the
overhead of primitive calls:
| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
ByteString compare: string with: string collated: collation ] ] timeToRun.
The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms
So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to
generate macros and use them instead of the function calls for
interpeterProxy functions.
Levente
[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison
Without lto:
00000000004ed310 <primitiveCompareString>:
4ed310: 41 55 push %r13
4ed312: 31 ff xor %edi,%edi
4ed314: 41 54 push %r12
4ed316: 55 push %rbp
4ed317: 53 push %rbx
4ed318: 48 83 ec 08 sub $0x8,%rsp
4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue>
4ed321: bf 01 00 00 00 mov $0x1,%edi
4ed326: 48 89 c3 mov %rax,%rbx
4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue>
4ed32e: bf 02 00 00 00 mov $0x2,%edi
4ed333: 48 89 c5 mov %rax,%rbp
4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue>
4ed33b: 48 89 df mov %rbx,%rdi
4ed33e: 49 89 c4 mov %rax,%r12
4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes>
4ed346: 48 85 c0 test %rax,%rax
4ed349: 74 0d je 4ed358 <primitiveCompareString+0x48>
...
With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
4fe4f0: 41 55 push %r13
4fe4f2: 41 54 push %r12
4fe4f4: 55 push %rbp
4fe4f5: 53 push %rbx
4fe4f6: 48 83 ec 08 sub $0x8,%rsp
4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax # 825078 <stackPointer.7294>
4fe501: 48 8b 18 mov (%rax),%rbx
4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp
4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12
4fe50c: 48 89 df mov %rbx,%rdi
4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes>
4fe514: 48 85 c0 test %rax,%rax
4fe517: 74 0d je 4fe526 <primitiveCompareString.35928+0x36>
...
So, gcc could optimize away the three stackvalue calls with 4 mov
instructions, but failed to inline isBytes() and the rest of the
functions.
More information about the Vm-dev
mailing list