Thinking about Exupery 0.14

bryce at kampjes.demon.co.uk bryce at kampjes.demon.co.uk
Fri Dec 21 21:00:46 UTC 2007


Igor Stasenko writes:
 > 
 > I suspect that main bottleneck in largeExplorers is not
 > compiled/bytecode code, but memory allocations and GC.
 > So, i doubt that you can gain any performance increase here.
 > 

Below's the raw numbers, this is from largeExplorers but with the
profiling compiler turned up to compile a bit more code. About 60% of
the time is going into the interpreter, compiled code, and primitives
that should be natively compiled. That's enough time to provide a
decent speed improvement. 70% is the normal amount spent in the
interpreter. The GC is probably only consuming about 5% of the time.

CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000
Counted LS_BUFFER_FULL events (LS Buffer 2 Full) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_BRANCHES_MISPREDICTED events (Retired branches mispredicted) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_INSNS events (Retired instructions (includes exceptions, interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
samples  %        samples  %        samples  %        samples  %        image name               app name                 symbol name
1792637  57.4476  65779    12.9048  391686   84.8932  1970781  58.4824  squeak                   squeak                   interpret
223739    7.1700  150       0.0294  211       0.0457  376848   11.1829  BitBltPlugin             BitBltPlugin             alphaBlendwith
110588    3.5439  48505     9.5159  1302      0.2822  104008    3.0864  BitBltPlugin             BitBltPlugin             copyBits
76361     2.4471  124873   24.4982  3679      0.7974  64529     1.9149  libc-2.4.so              libc-2.4.so              (no symbols)
65089     2.0859  91160    17.8842  2155      0.4671  12648     0.3753  no-vmlinux               no-vmlinux               (no symbols)
60351     1.9340  51072    10.0195  2297      0.4978  31543     0.9360  anon (tgid:6681 range:0xb1c0d000-0xb7b6c000) squeak                   (no symbols)
52940     1.6965  5        9.8e-04  1632      0.3537  82896     2.4599  B2DPlugin                B2DPlugin                fillSpanfromto
45634     1.4624  12633     2.4784  2849      0.6175  59950     1.7790  BitBltPlugin             BitBltPlugin             copyLoopPixMap
39297     1.2593  1        2.0e-04  161       0.0349  12604     0.3740  squeak                   squeak                   sweepPhase
34504     1.1057  31        0.0061  3079      0.6673  10196     0.3026  squeak                   squeak                   lookupMethodInClass
31797     1.0190  19        0.0037  3408      0.7386  39560     1.1739  squeak                   squeak                   markAndTrace
28467     0.9123  11        0.0022  1757      0.3808  15187     0.4507  squeak                   squeak                   updatePointersInRangeFromto
28316     0.9074  197       0.0386  2484      0.5384  22647     0.6720  BitBltPlugin             BitBltPlugin             loadBitBltFromwarping
26469     0.8482  7         0.0014  739       0.1602  39972     1.1862  squeak                   squeak                   finalizeReference
26380     0.8454  15        0.0029  342       0.0741  38994     1.1571  squeak                   squeak                   updatePointersInRootObjectsFromto
24235     0.7766  727       0.1426  522       0.1131  29049     0.8620  squeak                   squeak                   exuperyIsNativeContext
17343     0.5558  7         0.0014  790       0.1712  9012      0.2674  squeak                   squeak                   positive32BitValueOf
15935     0.5107  5546      1.0880  336       0.0728  22525     0.6684  squeak                   squeak                   allocateheaderSizeh1h2h3doFillwith
15559     0.4986  2712      0.5321  1191      0.2581  25552     0.7582  BitBltPlugin             BitBltPlugin             pixPaintwith
14371     0.4605  11        0.0022  3301      0.7155  19116     0.5673  squeak                   squeak                   commonAt
13348     0.4278  466       0.0914  383       0.0830  6048      0.1795  squeak                   squeak                   lookupSelectorclass

The anon block is the native code. What's interesting is the
instructions per clock is about 0.5 while the intepreter's
instructions per clock is a little over one. The native code has less
branch mispredicts but much more memory traffic.  About 8% of the time
the native code has the load store unit's buffer full and is probably
stalled waiting for a memory request to finish.

Based on the profiling I've done I'm fairly confident that one of the
reasons why the macro benchmarks are not often showing a performance
improvement on an Athon 64 is due to excess spill code causing too
much memory traffic. The register allocator is not handling heavy
register pressure well and I doubt the spill heuristics are ideal for
larger methods.

Bryce


More information about the Exupery mailing list