More Profiling

Wed Sep 26 10:20:58 UTC 2001

On Tue, 25 Sep 2001, Dan Ingalls wrote:

> Scott A Crosby <crosby at qwes.math.cmu.edu> wrote...
>
>
> Scott, I'm really enjoying this inquiry.
>

Heh, actually, this is really more like a distraction; I'm working on
implementing a multiuser text-based virtual world server, and I've been
profiling things to get an idea where the CPU is going. I put up some
testing/sample code a while ago, but I'm redoing my networking code and
writing a parser.

> Your study is interesting, because I never considered the effects of
> branch prediction in the current design.  The current management of
> the method cache was designed to minimize thrashing which is done by
> intentionally spreading entries in their reprobe chains.  The idea is

Better idea is to just use a larger cache so collisions are rare.

> that if even if two items collide once, it is extremely unlikely that
> they will collide again if placed elsewhere in their reprobe chains.
> So its figure of merit is freedom from contention among multiple
> entries of frequent access.
>

Unless you're invoking >330 different lookupclass&selector combos, in
which case, it'll flush out all 3 entries about 1/3 of the time. IE, I'd
say the flaw is that its always colliding, because the cache is saturated
instantly. (in fact, it saturates in 1.25 ms of CPU time. (see below))

> Another thing to note is that 4k cache loads per sec may not be a huge
> value.  My machine does 1.2M sends per second, so 4k would be
> excellent performance, right?
>

No, thats number of full method lookups through the heirarchy/sec. not
message sends/sec. Here, look at the profile of the top half-dozen
functions (of about an hour of using morphic and developing under squeak,
634 seconds of CPU time.)

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 59.69    383.93   383.93        1 383930.00 642977.57  interpret
  3.49    406.35    22.42 351799030     0.00     0.00  fetchClassOf
  3.41    428.30    21.95 15602918     0.00     0.00  lookupMethodInClass
  2.78    446.15    17.85 154761143     0.00     0.00  ioMSecs
  2.37    461.41    15.26      501    30.46    30.72  sweepPhase
  2.06    474.64    13.23 76766482     0.00     0.00  primitiveResponse
  2.04    487.77    13.13    10774     1.22     1.22  copyImage16To32
  1.53    497.61     9.84 38614401     0.00     0.00  instantiateContextsizeInBytes
  1.30    505.99     8.38   166590     0.05     0.05  copyLoopNoSource
  1.29    514.28     8.29  1071307     0.01     0.01  copyLoopPixMap
  1.12    521.47     7.19 151573427     0.00     0.00  ioLowResMSecs

Two things I've referred to already are are the incredible number of calls
to ioLowResMSecs and ioMSecs. (99.9% from primitiveResponse). Also,
fetchClassOf was not inline, which is fortunate for us, cause thats (+/-
.1%) the total number of method calls that were done during the hour.

Here are the line-by-line profiling statistics:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
  4.43     28.50    28.50                             interpret (gnu-interp.c:6401)
  1.83     40.27    11.77                             interpret (gnu-interp.c:8345)
  1.59     50.49    10.22                             interpret (gnu-interp.c:6396)
  1.44     59.73     9.24                             fetchClassOf (gnu-interp.c:3198)
  1.32     68.24     8.51                             interpret (gnu-interp.c:6505)
  1.29     76.52     8.28                             interpret (gnu-interp.c:6409)
  1.14     83.86     7.35                             interpret (gnu-interp.c:6508)
  1.14     91.19     7.33                             interpret (gnu-interp.c:8346)
  1.12     98.38     7.19 151573427    47.44    77.45  ioLowResMSecs (sqXWindow.c:1819)
  1.07    105.28     6.90                             interpret (gnu-interp.c:6400)
  0.99    111.64     6.36                             ioMSecs (sqXWindow.c:1828)
  0.96    117.78     6.14                             sweepPhase (gnu-interp.c:20279)
  0.94    123.81     6.03                             interpret (gnu-interp.c:6522)
  0.93    129.78     5.97                             sweepPhase (gnu-interp.c:20281)
  0.90    135.54     5.76                             interpret (gnu-interp.c:6487)
  0.89    141.25     5.71                             interpret (gnu-interp.c:6395)
  0.85    146.74     5.48                             interpret (gnu-interp.c:6261)

(BTW, line 6496 is the function call to: fetchClassOf)

A few things can be noted.. we did 15602918 full lookups in about 600
seconds of CPU time, so we do about 25000 full method lookups per second,
by this, I mean that the cache misses almost 25000 methods/CPU second.
Since the effective size of the cache is 330 slots, that means that we
essentially refill the cache 80 times/CPU sec!  You have more experience
than I, but this seems absolutely insane. :)  Is it?

When I finish my new IO streaming stuff, I'll play with vmmaker and try
just a larger cache. (That won't fix the branch misprediction, or the 30%
of the entries are zero problems.)  but it will help indicate what cache
size would minimize such excessive flushing.

> >...
> >But, if just increasing the size helps, I'd suggest leaving it at that.
>
> I think this probably makes sense, especially for VMs that are not
> minimized to running on PDAs and the like.  Before doing it, it would
> be worth getting actual stats, though, because I think we did a fair
> amount of measurement when we left the value that low.  The new
> macroBenchmarks should help in this regard.

I'd guess a scheme along the lines I propose, (which can fit 30% more
entries in 30% less RAM.) would help somewhat even there. Fortunately,
those CPU's aren't quite as subject to the same effects I refer to.

If someone can explain how to fit the data portion of an entry in 2 slots
(instead of the 3 there are now), You can stuff three times the number of
entries in the same space.

>
> Given your points about branch prediction, it is probably also worth
> revisiting the issue of whether to go with a more conventional
> approach with a larger cache.  Andreas did some experiments with this
> a couple of years ago, and it was certainly competitive.
>

Maybe.. The benefit of a larger cache is twofold: First, try to avoid the
branch misprediction penalty by having fewer collisions and making sure
the first entry is *good*.  Second, even a misprediction costs about 1/10
the cost of a full lookup, so probing 1-3 entries is still cheaper than a
full lookup, though I am not sure if it is worth the complexity.. A method
cache miss isn't all *that* expensive.

(How's this for an idea: we have a paramaterized hash function. We check
our miss rate, if it is over some threshold for too long, meaning that we
have two entries that are extremely frequently colliding, we alter the
paramater in the hash function hoping that the new hash table will not
have them colliding.)

QUERY: Can I write my caching code into a seperate class, yet have the
C-translater still inline all of the methods? It'd help a lot; Interpreter
is pretty unwieldy to browse&alter.

> >Though, also, I should point out that even though the actual cache is 512
> >entries, the effective cache is 2/3 that, from the basic block profiling
> >statistics of the addnewmethodtocache.  This is because we clean out all 3
> >probe points upon a miss.
> >
> >So, thus, we're now about 30% memory utilization. (we lost 30% of wasted
> >space in the records, and another 30% of the entries will be empty).
>
> ... so this may mean that the simpler scheme, even if it has worse
> collision stats, might win overall.
>

Factor of 3, if you can figure out how to fit entries into 4 slots in the
table, and you don't clean chains like you do now. (Then, a 4kb table
holds 1024 entries, three times the effective size of 330 you have now)

> >Anyone know how to fiddle with the C-generator to get it to generate C
> >code with 0-starting arrays that are aligned on a 16-byte boundry. If
> >someone can do that, and help me understand how to test VM stuff, I'll
> >write a the new cache.
>
> I can't read the last sentence, but I'm hoping you are going to try a

Ok.. Right now, the cache code uses an offset of 1,2,3,4,5 to access
fields in the array. If we align the start of the cache to a 16-byte
boundry, we'll want it to access at offsets: 0,1,2,3,4, so as to exploit
the alignment.

> simple cache with good alignment.  I think it's worth a try.  If you
> do comparisons, it would be nice to align the current cache as well.
>
> >Note that the design should be superior to what is there right now.
> >(roughly, twice the number of useful entries in the same memory space).
> >But it is intended to be used as a large cache, perhaps, 4k entries
> >(96kb), so flushing/invalidating it will be somewhat more expensive,
> >meaning that allocations between gc's should be increased.
>
> One nice thing about the current cache is that it can be (and is)
> selectively flushed, so we don't lose that much context.  If you do do

Selectively invalidated when the chains get too long, or selectively
invalidated when methods change?

Please seperate invalidated entries with flushed entries. They are
distinct concepts.

Flushing is the removal of entries from the cache entirely, overwriting
with zeros.  Invalidating is the marking of entries as invalid, which
isn't necessarily the same as overwriting them with zeros.. I'm not sure
if squeak does depend on the different semantics between flushing and
invalidation? This could matter if (say) GC scanned the method cache..
Does it?

> >How does this sound? Acceptable? Or is Jitter expected soon enough that
> >this isn't important.
>
> Jitter already exists, and it makes these stats look silly.  However

Heh, I care about linux, if it does get jitter, I'd just want to use that.

> it's probably fair to say that the interpreter will be around for a
> long time, and if its cache performance can be improved without too
> much trouble (especially if it ends up simpler), then it's probably
> worth doing so.

That which is already done, working, and hasn't been changed in years is
ALWAYS simpler than a newfangled scheme that only has marginal gain. (I'd
guess we might get 10% with a new cache, maybe 4% by just increasing the
size of the existing one.  Is it worth it, to me personally, with jitter
on the way?)

Besides, under linux, I can get 5-10% improvement by just fixing the easy
fruit; the uninlined findClassOf, the excessive calls to ioMSec.

>
> Keep up the good work!
>

Good profiling can find the most interesting stupidities in any codebase.
Apparently this rule also applies to code I've not written. :) There are
quite a few parts of squeak that I haven't touched, and may not for a long
time, so someone else will have to profile them (Balloon, Alice, ... )

Scott

BTW, for those who are interested in how I did it. Install Linux.

For full profiling, including basic block level, compile with options '-g
-pg -a'. (basic blocks are the parts between branches, so basic-block
level profiling will tell you each time any conditional is taken. Each
time a case: in a switch is taken, etc) Basic block profiling slows the
virtual machine down by about 15%.

If you just want line-level and function level, compile with flags '-g -pg'.

Run it a while, then exit normally. You get two files: gmon.out and bb.out

If you're doing basic-block level profiling, you need to convert bb.out so
that gprof can use it:

# perl /usr/share/doc/binutils/gprof/bbconv.pl < bb.out >bbFOO.out

Then,
# gprof -l -A -x squeak gmon.out bbFOO.out | less

will annotate your source code with number of invocations of any loop,
conditional, etc.

And, you can always do:

# gprof -l squeak gmon.out | less

will give you line level profiling (the second table given above)

# gprof squak gmon.out | less

will give you function-level profiling.

For more info, read the 'gcc' and 'gprof' texinfo documentation. Do *NOT*
read the man pages. The manpages are very out of date.