Performance profiling results...

Scott A Crosby crosby at qwes.math.cmu.edu
Sat Sep 22 16:36:21 UTC 2001


Hello.. I don't know if this has been done recently. GIven some of the
indications in the source code, I don't think it has been done..


Anyways, I have been running squeak under a profiler that supports
line-level profiling of optimized code.. I've run about an hour of squeak
through this (about 10 seconds of CPU time); doing various normal
window-dragging stuff, browsing, basically development work. And I've
found a few surprises.

I'm not sure if anyone else is interested in the raw results. I can make
them available upon request.

But a few things seem to stick out. (all of this is doing under
3.1alpha-latest)

With a possible 4-10% speedup.

--

Over 99% of the invocations of the function 'fetchClassOf' are when
invoked from 'commonSend' in the interpreter. (approximately 350 million).

The function call overhead alone is the third-longest taking line
in the profile. About 2.5% of the total runtime is spent on the function
call and return.

Here, the time-space tradeoff says inline.

-- 
ioLowResMSecs and ioMSecs (sqXWindow.c:1827) were invoked 150 million
times, or about 40,000/sec. Syscalls are cheap, but still. :)

approx 99.5% of these calls were made by the interpreter in
primitiveResponse (gnu-interp.c:15289) and (gnu-interp.c:16355)

And about 4% of the total CPU time was spent here.

Eliminating this call would probably help things a lot.

--

Within primitiveResponse, all of the primitiveFOO functions it invokes
(excluding primitiveFloatDividebyArg primitiveMakePoint primitiveBitShift)
are only invoked by it itself. Thats about 76 million outgoing function
calls..

Also, about 90% of the invocations of fetchClassOfNonInt
are from loadFloatOrIntFrom. And 99.5% of the invocations of floatValueOf
are from fetchClassOfNonInt. About 100 million function calls total.

Between these, thats about about 1% of the total CPU time is spent
fucntion calling.

--

Overall, the profile looked like:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 59.69    383.93   383.93        1 383930.00 642977.57  interpret
  3.49    406.35    22.42 351799030     0.00     0.00  fetchClassOf
  3.41    428.30    21.95 15602918     0.00     0.00  lookupMethodInClass
  2.78    446.15    17.85 154761143     0.00     0.00  ioMSecs
  2.37    461.41    15.26      501    30.46    30.72  sweepPhase
  2.06    474.64    13.23 76766482     0.00     0.00  primitiveResponse
  2.04    487.77    13.13    10774     1.22     1.22  copyImage16To32

--

Also, I was examining the message cache code and I thought it was a little
strange. I can see why, for example, you haven't used any clever tricks
to exploit cache effects in the CPU; you expect the CPU cache to
already contain the method cache.

But I was surprised to find that the cache entries started at 1,
instead of zero (introducing a lot of extra instructions), that it had
only 256 entries, and that it wasted 37% of its space...

I would have thought you'd have something along the lines of two tables:

lookupTable[probe] = messageSelector;
lookupTable[probe] = lkupClass;

and:

methodData[probe + 0] = newMethod;
methodData[probe + 1] = primitiveIndex;
methodData[probe + 2] = newNativeMethod;
methodData[probe + 2] = UNUSED;

(note, the two tables should be allocated contigiously in memory)

As you fit the same amount of information into less space, you waste less
CPU cacheline. Given that the single line:

			if (((methodCache[t5 + 1]) == messageSelector) &&
((methodCache[t5 + 2]) == lkupClass)) /* in commonLookup */

is consuming 4.5% of the CPU time, I'd suspect that the CPU is taking L1
cache mises or the branch prediction unit is mispredicting. I cannot
easily tell which from the information given.

If it is the branch prediction unit responsible, the methodcache should be
made larger, so that the methodcache has a higher hit rate. (It is really
really small) With my table design, a 1k entry cache would require 8kb
of space for the lookuptable, and 16kb for the methoddata. (But as
methodData entries are exactly one cache line, if a methodData isn't used,
the CPU will flush it out of L1 cache.

The hit rate from the first two levels in the methodcache appears to be
about 90%, in that when judging on the profile time, it spends 4.4% of all
CPU time doing the first probe, 1.17% doing the second probe, and .4% on
the third probe. This implies that the first two probes have a 90% hit
rate.

Taking a guess that it is branch misprediction how does this sound?


Have a giant method cache, 32k entries, so collissions are rare.
Furthermore, each entry is a duplet, (two 8-byte slots in the same CPU
cacheline).

Upon any lookup, look in the first slot for a match and take it.. If it
isn't in the first slot, try the second slot, if that matches, swap the
two slots. If both slots fail, do the slow method lookup code.

The idea is that having your first probe fail means that you're taking a
branch misprediction miss. Secondly, probing interfaces with the L1
cache in subtle ways... In that, making a bigger methodcache with lower
contention and trusting L1 to load only the most important parts seems
like it could be better than having a small cache all in L1 and doing
several probes from the methodcache.  In that with the first scheme, you
can have a huge cache, yet trust L1 to only keep the parts that are being
used.. IE, the methodcache actually becomes a sorta L2 methodcache, the
CPU's L1 cache chooses an optimal subset of the methodcache. the
methodcache turns into a way to keep from relooking up methods.

Morphic has an incredible variety of methods and many of these methods are
called frequently, but not extremely frequently.


The penalties of having a large methodcache are that there is more to
flush on GC's... Which reminds me, why is 'allocations between gc's' still
set at 4000? If I set it to 400,000, I still get incremental gc pauses of
50ms on my P2-450, but I only get them every 10 seconds or so.  And it
also means that the system must flush the methodcache far less often.

Anyone up for changing the default to 40000?  (still under 10ms)



--

This is the stuff that seems really out of place in an hour scan of it. I
can feed the full profiling output if anyone is interested. (about 150kb
compressed)

Scott.










Like about a 5% speedup.








--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.





More information about the Squeak-dev mailing list