[Vm-dev] Primitives, interpreterProxy, function calls, lto

Eliot Miranda eliot.miranda at gmail.com
Fri Apr 20 17:34:31 UTC 2018


Hi Levente,

On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <leves at caesar.elte.hu>
wrote:

>
> Hi All,
>
> I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
> (mostly done, I'm stuck with the tests). My main goal was to add the
> necessary checks to make these primitives safer.
> But while I was doing that, I decided to optimize the code a bit by
> getting rid of a few repeated function calls, notably stackValue() calls,
> because those are pretty much the only ones that are called more than once
> with the same argument from these primitives.
>
> This kind of rewrite resulted in small, but measurable speedup (~5-15%),
> despite of the new checks being in place.
>

This is cool.


> Unfortunately, the generated assembly code for almost all interpeterProxy
> methods will be a function call (callq).
> IMHO the best solution would be if these calls were just macros, so that
> the compiler could optimize them away, but I don't know how to achieve that.
>

Look at
VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:.
This has to generate code to
- declare interpreterProxy functions imported from the VM (for builtin
plugins)
- declare function pointers for
- assign the function pointers from the interpreterProxy in the
plugin's setInterpreter routine.

So you would have to add a third option for builtin plugins which declared
those functions as macros, or generated them as local static inlined
functions.


> So, I decided to check if the C compiler could do that for us, and yes,
> gcc starting from 4.5 supports link time optimization (lto)[1].
> The results are promising, performance is better, but it still won't
> optimize everything it could[2].
>
> I wrote a small benchmark to see how lto and my rewrite affects the
> overhead of primitive calls:
>
> | string collation |
> string := ''.
> collation := (0 to: 255) asByteArray.
> [ 1 to: 10000000 do: [ :i |
>         ByteString compare: string with: string collated: collation ] ]
> timeToRun.
>
> The results on my machine were;
> original: 16 function calls - 795 ms
> rewrite-no lto: 13 function calls - 762 ms
> rewrite-with lto: 10 function calls - 674 ms
>
> So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
> compiling with gcc 4.5+ to see how stable it is.
> Also, I would be happy if someone could point me to the direction to
> generate macros and use them instead of the function calls for
> interpeterProxy functions.
>

The easiest thing would be to generate them as local inlined static
functions.  But I would only bother to do this for performance-critical
primitives, for example by having plugin classes (such as
LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as
performance4-critical via a class-side method.


> Levente
>
> [1] https://gcc.gnu.org/wiki/LinkTimeOptimization
> [2] Generated assembly code for comparison
>
> Without lto:
> 00000000004ed310 <primitiveCompareString>:
>   4ed310:       41 55                   push   %r13
>   4ed312:       31 ff                   xor    %edi,%edi
>   4ed314:       41 54                   push   %r12
>   4ed316:       55                      push   %rbp
>   4ed317:       53                      push   %rbx
>   4ed318:       48 83 ec 08             sub    $0x8,%rsp
>   4ed31c:       e8 2f 19 f8 ff          callq  46ec50 <stackValue>
>   4ed321:       bf 01 00 00 00          mov    $0x1,%edi
>   4ed326:       48 89 c3                mov    %rax,%rbx
>   4ed329:       e8 22 19 f8 ff          callq  46ec50 <stackValue>
>   4ed32e:       bf 02 00 00 00          mov    $0x2,%edi
>   4ed333:       48 89 c5                mov    %rax,%rbp
>   4ed336:       e8 15 19 f8 ff          callq  46ec50 <stackValue>
>   4ed33b:       48 89 df                mov    %rbx,%rdi
>   4ed33e:       49 89 c4                mov    %rax,%r12
>   4ed341:       e8 4a d4 f3 ff          callq  42a790 <isBytes>
>   4ed346:       48 85 c0                test   %rax,%rax
>   4ed349:       74 0d                   je     4ed358
> <primitiveCompareString+0x48>
> ...
>
> With lto:
> 00000000004fe4f0 <primitiveCompareString.35928>:
>   4fe4f0:       41 55                   push   %r13
>   4fe4f2:       41 54                   push   %r12
>   4fe4f4:       55                      push   %rbp
>   4fe4f5:       53                      push   %rbx
>   4fe4f6:       48 83 ec 08             sub    $0x8,%rsp
>   4fe4fa:       48 8b 05 77 6b 32 00    mov    0x326b77(%rip),%rax
> # 825078 <stackPointer.7294>
>   4fe501:       48 8b 18                mov    (%rax),%rbx
>   4fe504:       48 8b 68 08             mov    0x8(%rax),%rbp
>   4fe508:       4c 8b 60 10             mov    0x10(%rax),%r12
>   4fe50c:       48 89 df                mov    %rbx,%rdi
>   4fe50f:       e8 bc fe fb ff          callq  4be3d0 <isBytes>
>   4fe514:       48 85 c0                test   %rax,%rax
>   4fe517:       74 0d                   je     4fe526
> <primitiveCompareString.35928+0x36>
> ...
>
> So, gcc could optimize away the three stackvalue calls with 4 mov
> instructions, but failed to inline isBytes() and the rest of the functions.
>


V nice.

Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack.  The
idea here is that if a primitive does not have a deep call chain (only
contains local loops and simple argument checking, does not call allocation
or GC, etc) then
- the interpreter can implement it as a simple C function that takes its
arguments as parameters of the function and returns the result object, or 0
on failure (0 not being an up), and
- the JIT can call it directly from machine code, avoiding the slow stack
switch

Currently we don't use this at all.  We used this only for hash multiply
(see mcprimHashMultiply:), but superseded it with a JIT primitive.
However, it could be used for several MiscPrimitivePlugin primitives.  And
of course, this is orthogonal to the improvements you're achieving through
inlining argument marshaling.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180420/d39faf8e/attachment.html>


More information about the Vm-dev mailing list