[Vm-dev] Primitives, interpreterProxy, function calls, lto
Eliot Miranda
eliot.miranda at gmail.com
Fri Apr 20 17:34:31 UTC 2018
Hi Levente,
On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <leves at caesar.elte.hu>
wrote:
>
> Hi All,
>
> I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
> (mostly done, I'm stuck with the tests). My main goal was to add the
> necessary checks to make these primitives safer.
> But while I was doing that, I decided to optimize the code a bit by
> getting rid of a few repeated function calls, notably stackValue() calls,
> because those are pretty much the only ones that are called more than once
> with the same argument from these primitives.
>
> This kind of rewrite resulted in small, but measurable speedup (~5-15%),
> despite of the new checks being in place.
>
This is cool.
> Unfortunately, the generated assembly code for almost all interpeterProxy
> methods will be a function call (callq).
> IMHO the best solution would be if these calls were just macros, so that
> the compiler could optimize them away, but I don't know how to achieve that.
>
Look at
VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:.
This has to generate code to
- declare interpreterProxy functions imported from the VM (for builtin
plugins)
- declare function pointers for
- assign the function pointers from the interpreterProxy in the
plugin's setInterpreter routine.
So you would have to add a third option for builtin plugins which declared
those functions as macros, or generated them as local static inlined
functions.
> So, I decided to check if the C compiler could do that for us, and yes,
> gcc starting from 4.5 supports link time optimization (lto)[1].
> The results are promising, performance is better, but it still won't
> optimize everything it could[2].
>
> I wrote a small benchmark to see how lto and my rewrite affects the
> overhead of primitive calls:
>
> | string collation |
> string := ''.
> collation := (0 to: 255) asByteArray.
> [ 1 to: 10000000 do: [ :i |
> ByteString compare: string with: string collated: collation ] ]
> timeToRun.
>
> The results on my machine were;
> original: 16 function calls - 795 ms
> rewrite-no lto: 13 function calls - 762 ms
> rewrite-with lto: 10 function calls - 674 ms
>
> So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
> compiling with gcc 4.5+ to see how stable it is.
> Also, I would be happy if someone could point me to the direction to
> generate macros and use them instead of the function calls for
> interpeterProxy functions.
>
The easiest thing would be to generate them as local inlined static
functions. But I would only bother to do this for performance-critical
primitives, for example by having plugin classes (such as
LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as
performance4-critical via a class-side method.
> Levente
>
> [1] https://gcc.gnu.org/wiki/LinkTimeOptimization
> [2] Generated assembly code for comparison
>
> Without lto:
> 00000000004ed310 <primitiveCompareString>:
> 4ed310: 41 55 push %r13
> 4ed312: 31 ff xor %edi,%edi
> 4ed314: 41 54 push %r12
> 4ed316: 55 push %rbp
> 4ed317: 53 push %rbx
> 4ed318: 48 83 ec 08 sub $0x8,%rsp
> 4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue>
> 4ed321: bf 01 00 00 00 mov $0x1,%edi
> 4ed326: 48 89 c3 mov %rax,%rbx
> 4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue>
> 4ed32e: bf 02 00 00 00 mov $0x2,%edi
> 4ed333: 48 89 c5 mov %rax,%rbp
> 4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue>
> 4ed33b: 48 89 df mov %rbx,%rdi
> 4ed33e: 49 89 c4 mov %rax,%r12
> 4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes>
> 4ed346: 48 85 c0 test %rax,%rax
> 4ed349: 74 0d je 4ed358
> <primitiveCompareString+0x48>
> ...
>
> With lto:
> 00000000004fe4f0 <primitiveCompareString.35928>:
> 4fe4f0: 41 55 push %r13
> 4fe4f2: 41 54 push %r12
> 4fe4f4: 55 push %rbp
> 4fe4f5: 53 push %rbx
> 4fe4f6: 48 83 ec 08 sub $0x8,%rsp
> 4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax
> # 825078 <stackPointer.7294>
> 4fe501: 48 8b 18 mov (%rax),%rbx
> 4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp
> 4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12
> 4fe50c: 48 89 df mov %rbx,%rdi
> 4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes>
> 4fe514: 48 85 c0 test %rax,%rax
> 4fe517: 74 0d je 4fe526
> <primitiveCompareString.35928+0x36>
> ...
>
> So, gcc could optimize away the three stackvalue calls with 4 mov
> instructions, but failed to inline isBytes() and the rest of the functions.
>
V nice.
Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack. The
idea here is that if a primitive does not have a deep call chain (only
contains local loops and simple argument checking, does not call allocation
or GC, etc) then
- the interpreter can implement it as a simple C function that takes its
arguments as parameters of the function and returns the result object, or 0
on failure (0 not being an up), and
- the JIT can call it directly from machine code, avoiding the slow stack
switch
Currently we don't use this at all. We used this only for hash multiply
(see mcprimHashMultiply:), but superseded it with a JIT primitive.
However, it could be used for several MiscPrimitivePlugin primitives. And
of course, this is orthogonal to the improvements you're achieving through
inlining argument marshaling.
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180420/d39faf8e/attachment.html>
More information about the Vm-dev
mailing list