[Vm-dev] Cog Primitive Performance

Mon Apr 17 23:45:47 UTC 2017

Hi All,

    Clément and I have been looking at primitive performance in the context
of string hash.  We'll say more about string hash in a later message (and
there is nothing to be alarmed about here).  But it is worth describing the
performance measurements of the various kinds of primitive dispatch in
Cog.  The following measurements are collected from a 64-bit Squeak trunk
image with a 64-bit Cog Spur VM with four different implementations of
hashMultiply, three of them primitive.  The times are collected on a 2.3
GHz Intel Core i7 MacMini.

The doit below compares the time taken to go round a loop containing 10
unrolled copies of an operation 10 million times, for a total of 100
million invocations of the operation.  The operations are
- doing nothing
- sending yourself
- invoking the SmallInteger>>hashMultiply method
- invoking a version of hashMultiply written as a normal interpreter
primitive (more on this below)
- invoking a version of hashMultiply written as a C function called from
machine code
- invoking a version of hashMultiply written as a machine code primitive

FYI, hash multiply is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384))
bitAnd: 16r0FFFFFFF

and when written as a primitive is
hashMultiply
| low |
low := self bitAnd: 16383.
^(16r260D * low + (16r260D * (self bitShift: -14) + (16r0065 * low) *
16384))
bitAnd: 16r0FFFFFFF
because we don't care about the multiply by 16384 overflowing; when written
as a primitive it won't overflow into a LargePositiveInteger.

Here is the doit:

| n them |
n := 10000000.
them := {
[1 to: n do: [:v| v. v. v. v. v. v. v. v. v. v]]. [1 to: n do: [:v| v
yourself. v yourself. v yourself. v yourself. v yourself. v yourself. v
yourself. v yourself. v yourself. v yourself]].
[1 to: n do: [:v| v hashMultiply. v hashMultiply. v hashMultiply. v
hashMultiply. v hashMultiply. v hashMultiply. v hashMultiply. v
hashMultiply. v hashMultiply. v hashMultiply]].
[1 to: n do: [:v| v hashMultiplyInterpreterPrimitive. v
hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v
hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v
hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v
hashMultiplyInterpreterPrimitive. v hashMultiplyInterpreterPrimitive. v
hashMultiplyInterpreterPrimitive]].
[1 to: n do: [:v| v hashMultiplyPrimitive. v hashMultiplyPrimitive. v
hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v
hashMultiplyPrimitive. v hashMultiplyPrimitive. v hashMultiplyPrimitive. v
hashMultiplyPrimitive. v hashMultiplyPrimitive]].
[1 to: n do: [:v| v hashMultiplyNativePrimitive. v
hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v
hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v
hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v
hashMultiplyNativePrimitive. v hashMultiplyNativePrimitive. v
hashMultiplyNativePrimitive]] }.
them := them, them.
them collect: [:b| b timeToRun]

and the results:

#(19 235 2321 2453 615 314
   19 233 2317 2473 614 313)

Writing this as a table we have
19/19 null
235/233 v yourself (remember divide by 10 to compare against the loop
overhead)
2321/2317 hashMultiply
2453/2473 hashMultiplyInterpreterPrimitive
615/614 hashMultiplyPrimitive
314/313 hashNativePrimitive

(the time for) null is the cost of performing the loop arithmetic and the
back jump.  The JIT eliminates the push and pop of v so the block is
effectively empty.

v yourself is the overhead of sending a message to a SmallInteger, and
returning self, plus the loop overhead.  So the loop overhead is comparable
to the send overhead.  Unsurprising; the loop contains an inlined add and
an inlined compare.

hashMultiply is the cost of evaluating hashMultiply in Cog jitted code;
i.e. as it exists today in standard Squeak and Pharo images

hashMultiplyInterpreterPrimitive is the cost of evaluating a primitive
version of hashMultiply when written as "normal interpreter primitives".
Such primitives are written to take their receiver and arguments from the
Smalltalk stack, accessing the stack through the stackPointer global
variable, and answering the result by popping the stackPointer and
replacing top of stack with the result.  When called from a jitted method,
the invocation of the interpreter primitive involves
- writing the native stack and frame pointers to the stackPointer and
framePointer variables
- setting the native stack and frame pointers to the C stack (the C stack
at the point the interpreter was entered either at start-up or after a
callback)
- calling the C function
- undoing the stack switch
- testing primFailCode, the flag that holds the primitive error code and
hence indicates a primitive succeeded if 0

hashMultiplyPrimitive is an experimental implementation of the primitive
where it is written as a C function that takes its argument (the receiver)
as an argument, and answers its result by returning it, setting
primFailCode and answering 0 on failure (because 0 is never a valid object;
0 can be used to flag failure).  The primitive is run on the Smalltalk
stack, avoiding the stack switch in a normal interpreter primitive
invocation.  Because the Smalltalk stack is divided into small pages this
form can only be used for C functions that will not consume much stack (not
recurse, or call a deep call chain, nor do any stack allocation, nor raise
an exception, etc).
When called from a jitted method, the invocation of the C primitive involves
- saving any live caller-saved registers (i.e. the register containing the
receiver)
- marshaling the live registers (in this case only the register containing
the receiver) as argument(s) according to the C ABI
- calling the C function
- restoring any caller-saved registers
- testing the C result register and either copying it to the
receiver/result register and returning, or falling through to primitive
failure code

hashMultiplyNativePrimitive is the standard fast implementation of a
primitive, using the it's inbuilt assembler; i.e. the primitive is written
in an executed assembly code from which the it is constructed.  Because
this is effectively assembly programming we restrict its use to relatively
simple performance-critical primitives (SmallInteger, BoxedFloat64 and
SmallFloat64 arithmetic and comparison, basicNew, basicNew:, shallowCopy,
closure value, perform:[with:with:], #==. #~~, #class, basicAt:,
basicAt:put:, basicSize).

As you can see, normal interpreter primitives are woefully slow.  The cost
is so high that jitted Smalltalk code just beats the primitive, even though
the normal Smalltalk code does one more operation and involves five sends
(the JIT is clever enough to inline the bitShift: and bitAnd: operations).
Alas, most primitives in the VM are written in this style.  It has two
advantages; first that there exists a large number of existing
implementations using this style, and second that the style makes writing
primitives that take a variable number of arguments easy (so for example
getVMParameters: vmParameterAt: and vmParameterAt:put: are written as one
large primitive in the VM code).

Writing the primitive as a C function that takes its parameters as
arguments and answers its result as the function's return value, is much
faster.  We're therefore interested in using this scheme to reimplement
primitives such as replaceFrom:to:with:startingAt: that are
performance-critical but too complex to write as native machine code
primitives without developing ulcers, loosing whatever hair color one has
left, getting psoriasis, etc.

What's perhaps surprising, but should actually be unsurprising, is that
there is still overhead in calling a very simple C function compared to
embedding the code to perform the primitive directly in the jitted machine
code.  It is unsurprising in this case as it shows that the overhead of a
send is very similar to the overhead of calling a one-argument C function.

What's inconvenient about these C function called on the Smalltalk stack
style primitives is that currently our C calling marshaling machinery only
handles 4 arguments, a limitation that means we don't have to pass
arguments on the stack on ARM and WIN64, both of which have a
four-integer-register-arguments calling convention, but writing
replaceFrom:to:with:startingAt: as a C function requires 5 arguments,
because we have to supply the receiver as well.  So the C marshalling
machinery will have to be extended to pass arguments in both registers and
on the stack.  Volunteers welcome ;-)

You may guess where this conversation is going in terms of string hash.
But for the moment we can conclude that
- normal interpreter primitives should not be relied upon for
performance-critical primitives
- the experimental C function called on the Smalltalk stack style
primitives look like a good way to write more complex primitives (providing
we can invoke them).  While they are a little slower than machine code
primitives to invoke, the cost will soon be amortized for complex
operations.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170417/1ef4f546/attachment-0001.html>