[Vm-dev] Primitives, interpreterProxy, function calls, lto

Wed Apr 18 21:54:04 UTC 2018

Hi All,

I'm in the progress of rewriting the primitives of MiscPrimitivePlugin 
(mostly done, I'm stuck with the tests). 
My main goal was to add the necessary checks to make these primitives 
safer.
But while I was doing that, I decided to optimize the code a bit by 
getting rid of a few repeated function calls, notably stackValue() calls, 
because those are pretty much the only ones that are called more than 
once with the same argument from these primitives.

This kind of rewrite resulted in small, but measurable speedup (~5-15%), 
despite of the new checks being in place.

Unfortunately, the generated assembly code for almost all interpeterProxy 
methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that 
the compiler could optimize them away, but I don't know how to achieve 
that.
So, I decided to check if the C compiler could do that for us,m and yes, 
gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't 
optimize everything it could[2].

I wrote a small benchmark to see how lto and my rewrite affects the 
overhead of primitive calls:

| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
 	ByteString compare: string with: string collated: collation ] ] timeToRun.

The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms

So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when 
compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to 
generate macros and use them instead of the function calls for 
interpeterProxy functions.

Levente

[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison

Without lto:
00000000004ed310 <primitiveCompareString>:
   4ed310:       41 55                   push   %r13
   4ed312:       31 ff                   xor    %edi,%edi
   4ed314:       41 54                   push   %r12
   4ed316:       55                      push   %rbp
   4ed317:       53                      push   %rbx
   4ed318:       48 83 ec 08             sub    $0x8,%rsp
   4ed31c:       e8 2f 19 f8 ff          callq  46ec50 <stackValue>
   4ed321:       bf 01 00 00 00          mov    $0x1,%edi
   4ed326:       48 89 c3                mov    %rax,%rbx
   4ed329:       e8 22 19 f8 ff          callq  46ec50 <stackValue>
   4ed32e:       bf 02 00 00 00          mov    $0x2,%edi
   4ed333:       48 89 c5                mov    %rax,%rbp
   4ed336:       e8 15 19 f8 ff          callq  46ec50 <stackValue>
   4ed33b:       48 89 df                mov    %rbx,%rdi
   4ed33e:       49 89 c4                mov    %rax,%r12
   4ed341:       e8 4a d4 f3 ff          callq  42a790 <isBytes>
   4ed346:       48 85 c0                test   %rax,%rax
   4ed349:       74 0d                   je     4ed358 <primitiveCompareString+0x48>
...

With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
   4fe4f0:       41 55                   push   %r13
   4fe4f2:       41 54                   push   %r12
   4fe4f4:       55                      push   %rbp
   4fe4f5:       53                      push   %rbx
   4fe4f6:       48 83 ec 08             sub    $0x8,%rsp
   4fe4fa:       48 8b 05 77 6b 32 00    mov    0x326b77(%rip),%rax        # 825078 <stackPointer.7294>
   4fe501:       48 8b 18                mov    (%rax),%rbx
   4fe504:       48 8b 68 08             mov    0x8(%rax),%rbp
   4fe508:       4c 8b 60 10             mov    0x10(%rax),%r12
   4fe50c:       48 89 df                mov    %rbx,%rdi
   4fe50f:       e8 bc fe fb ff          callq  4be3d0 <isBytes>
   4fe514:       48 85 c0                test   %rax,%rax
   4fe517:       74 0d                   je     4fe526 <primitiveCompareString.35928+0x36>
...

So, gcc could optimize away the three stackvalue calls with 4 mov 
instructions, but failed to inline isBytes() and the rest of the 
functions.