[Vm-dev] [Pharo-dev] Fed up

Fri Jan 24 07:59:35 UTC 2020

Hi Jan,

   well how about these?  Scroll down past the definitions to see the
benchmarker:.  The point about benchFib is that 1 is added for every
activation so the result is the number of calls required to evaluate it.
Hence divide by the time and one gets activations per second.  Very
convenient.  The variations are between a method using block recursion, a
method on Integer where the value is accessed as self, a method using
perform:, and two methods that access the value as an argument, one with a
SmallInteger receiver and the other with nil as the receiver.

!BlockClosure methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:55'!
benchFib: arg
| benchFib |
benchFib := [:n| n < 2
ifTrue: [1]
ifFalse: [(benchFib value: n - 1) + (benchFib value: n - 2) + 1]].
^benchFib value: arg! !

!Integer methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:10'!
benchFib: n
^n < 2
ifTrue: [1]
ifFalse: [(self benchFib: n-1) + (self benchFib: n-2) + 1]! !

!Integer methodsFor: 'benchmarks' stamp: 'jm 11/20/1998 07:06'!
benchFib
^ self < 2
ifTrue: [1]
ifFalse: [(self-1) benchFib + (self-2) benchFib + 1]! !

!Symbol methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:57'!
benchFib: n
^n < 2
ifTrue: [1]
ifFalse: [(self perform: #benchFib: with: n - 1) + (self perform:
#benchFib: with: n - 2) + 1]! !

!UndefinedObject methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:09'!
benchFib: n
^n < 2
ifTrue: [1]
ifFalse: [(self benchFib: n-1)  + (self benchFib: n-2) + 1]! !

Collect result / seconds.  Bigger is faster (more calls per second).  Using
Integer receivers involves a branch in the inline cache check, whereas all
the others have no such jump.  This is a 64-bit Squeak 5.2 image on my 2.9
GHz Intel Core i9 15" 2018 MacBookPro (thanks Doru!).  And I'm using the
SistaV1 bytecode set with full blocks (no block dispatch to reach the code
for a particular block; each block is its own method).

| n collector blocks times |
n := 42.
collector := [:block| | t r | t := [r := block value] timeToRun. { t. r. (r
* 1000.0 / t) rounded }].
blocks := { [n benchFib]. [n benchFib: n]. [nil benchFib: n]. [#benchFib:
benchFib: n]. [[] benchFib: n] }.
times := blocks collect: collector; collect: collector. "twice to ensure
measuring hot code".
(1 to: blocks size) collect: [:i| { (blocks at: i) decompile }, (times at:
i), {((times at: i) last / times first last * 100) rounded }]

{{{ [n benchFib]} . 3734 . 866988873 . 232187700 . 100 } .
 {{ [n benchFib: n]} . 3675 . 866988873 . 235915340 . 102 } .
 {{ [nil benchFib: n]} . 3450 . 866988873 . 251301123 . 108 } .
 {{ [#benchFib: benchFib: n]} . 5573 . 866988873 . 155569509 . 67} .
 {{ [[] benchFib: n]} . 4930 . 866988873 . 175859812 . 76 }}

So... the clock is very granular (you see this at low N}.
blocks are 76% as fast as straight integers.
perform: is 67% as fast as straight integers (not too shabby; but then
integers are crawling).
Fastest is sending to a non-immediate receiver and accessing the value as
an argument.
The rest indicate that frame building is really expensive and dominates
differences between accessing the value as the receiver or accessing it as
an argument, whether there's a jump in the inline cache check, etc. This
confirms what we found many years ago that if the ifTrue: [^1] branch can
be done frameless, or that significant inlining can occur (as an adaptive
optimizer can achieve) then things go a lot faster.  But on the Cog
execution architecture blocks and perform: are p.d.q. relative to vanilla
sends.

On Thu, Jan 23, 2020 at 2:35 AM Jan Vrany <jan.vrany at fit.cvut.cz> wrote:

>
> Eliot,
>
> > 2) the lack of inline caches for #perform: (again, I am just guessing in
> > > this case).
> > >
> >
> > Right.  There is only the first level method lookup cache so it has
> > interpreter-like performance.  The selector and classs of receiver have
> to
> > be hashed and the first-level method lookup cache probed.  Way slower
> than
> > block activation.  I will claim though that Cog/Spur OpenSmalltalk's JIT
> > perform implementation is as good as or better than any other Smalltalk
> > VM's.  IIRC VW only machine coded/codes perform: and perform:with:
>
> Do you have a benchmark for perform: et. al.? I'd be quite interested.
> Last time I was on this topic, I struggled to come up with a benchmark
> that would represent any hope-to-be-like-real-workload benchmark (and whose
> results I could interpret :-)
>
> Jan
>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200123/e92fd423/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchFibs.st
Type: application/octet-stream
Size: 1915 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20200123/e92fd423/attachment.obj>