Performance profiling results...

Sat Sep 22 21:38:12 UTC 2001

On Sat, 22 Sep 2001, Andreas Raab wrote:

>
> For speed, yes. But unfortunately, various primitives do take some time
> (just think about BitBlt, sound generation etc) and if they do it is at
> times critical to capture it right after the primitive invokation. So
> there's not a lot we can do except from trying to make ioLowResMSecs as
> cheap as somehow possible (its only intend is to figure out if we have spend
> a "significant" amount of time or not).
>

Ok.. How about having a SIGALARM handler that increments a counter 100
times a second? THe thing checks the counter before&after the primitive to
see if it changed by more than 3 steps (30ms).

Trust me, getting rid of this would eliminate 40,000 system calls A
SECOND, and 4% of the runtime.

Here is the beginning of the raw results:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 59.69    383.93   383.93        1 383930.00 642977.57  interpret
  3.49    406.35    22.42 351799030     0.00     0.00  fetchClassOf
  3.41    428.30    21.95 15602918     0.00     0.00  lookupMethodInClass
  2.78    446.15    17.85 154761143     0.00     0.00  ioMSecs
  2.37    461.41    15.26      501    30.46    30.72  sweepPhase
  2.06    474.64    13.23 76766482     0.00     0.00  primitiveResponse
  2.04    487.77    13.13    10774     1.22     1.22  copyImage16To32
  1.53    497.61     9.84 38614401     0.00     0.00   instantiateContextsizeInBytes
  1.30    505.99     8.38   166590     0.05     0.05  copyLoopNoSource
  1.29    514.28     8.29  1071307     0.01     0.01  copyLoopPixMap
  1.12    521.47     7.19 151573427     0.00     0.00  ioLowResMSecs

You really want to gid rid of the 150 million calls to those two
io functions.

>
> Hm ... interesting point. You mean we should try to inline them like the
> bytecodes?! Could make sense for various of the "quick" primitives.
>

Inlining at least some of the functions here makse sense.  primitiveSize
for example. 1/3 of the invocations to primitiveResponse are meant for
primitiveSize. .25% maybe.

Inlining the floating point code loadFloatOrIntFrom and its children:

                0.06    0.07  865306/74964162     primitiveFloatDividebyArg [106]
                0.12    0.14 1671174/74964162     primitiveFloatMultiplybyArg [82]
                0.13    0.14 1720188/74964162     primitiveFloatSubtractfromArg [105]
                1.16    1.33 15755316/74964162     primitiveFloatAddtoArg
                1.19    1.36 16186178/74964162     primitiveFloatLessthanArg [49]
                1.23    1.41 16770078/74964162     primitiveFloatEqualtoArg [47]
                1.62    1.85 21995922/74964162     primitiveFloatGreaterthanArg [39]
[20]     1.8    5.51    6.30 74964162         loadFloatOrIntFrom [20]
                3.54    0.00 65772653/71609532     fetchClassOfNonInt [43]
                2.77    0.00 25798728/25811885     floatValueOf [54]

will probable gain another percent or so in general use, probably a lot
more than that if the code is heavy on numerics.

(the X/Y counts means that X of the Y invocations of the function came
from primitiveResponse) 20% of the CPU time is spent in this function or
functions it calls.

Not all of them warrant it, but at least in normal editing use, quite a
few could gain a little by being inlined.

>
> [Re: MCache]
> > But I was surprised to find that the cache entries started at 1,
> > instead of zero (introducing a lot of extra instructions), that it had
> > only 256 entries, and that it wasted 37% of its space...
>
> How did you find this (btw, the mcache has 512 entries not 256)?!
>

My error. I had remembered the array declaration as being 2kb/8 entries
each.. It is 512.

How I can tell is note the generated C code for accessing it. For example:

int addNewMethodToCache(void) {
    int hash;
    int probe;
    int p;

        compilerInitialized && (compilerTranslateMethod());

        /* drop low-order zeros from addresses */

        hash = messageSelector ^ lkupClass;
        for (p = 0; p <= (3 - 1); p += 1) {
                probe = (((unsigned) hash) >> p) & 4088;
                if ((methodCache[probe + 1]) == 0) {
                        methodCache[probe + 1] = messageSelector;
                        methodCache[probe + 2] = lkupClass;
                        methodCache[probe + 3] = newMethod;
                        methodCache[probe + 4] = primitiveIndex;
                        methodCache[probe + 5] = newNativeMethod;
                        return null;
                }
        }
        /* first probe */

        probe = hash & 4088;
        methodCache[probe + 1] = messageSelector;
        methodCache[probe + 2] = lkupClass;
        methodCache[probe + 3] = newMethod;
        methodCache[probe + 4] = primitiveIndex;
        methodCache[probe + 5] = newNativeMethod;
        for (p = 1; p <= (3 - 1); p += 1) {
                probe = (((unsigned) hash) >> p) & 4088;
                methodCache[probe + 1] = 0;
        }
}

Note first that offsets 'probe + 0' 'probe + 6' and 'probe + 7' are not
used. Thus meaning that 3/8 of the slots (37.5%) are not used. Secondly,
note that it is not using 'probe + 0' at all. Every access has an offset!

Which also means that the cachelines holding it will have a lot of
unused/worthless data. This is why I suggested split-tables, with say 4
probings:

 probe1 probe1+1
 probe2 probe2+1

better'n 90% hit rate with only 2 cache lines, and I can make the cache as
big as I want, (probably useful with something like morphic where a single
screen requires executing dozens of methods on dozens of classes;
thrashing the current cache.

I'll have to experiment. THen again, won't jitter be coming online soon,
making the current interpreter mostly a curiosity?

>
> Dunno. Try it - change MethodCacheEntries in Interpreter, compile a new VM
> and profile.
>

How do I compile a new VM?

>
> Hm ... not really. At least not as long you can't show a serious performance
> improvement. Which I find hard to believe, mostly because John Mc recently
> changed the IGC mcache flush to be selective (e.g., flush only entries that
> are actually in GC range). So except from enumerating the cache there's not
> a lot going on wrt GC.
>

With 10 gc's a second (which seems typical) 16kb for the cache and given
stock memory bandwidth. (about 40mbyte/sec) thats about a quarter of a
percent of the CPU just on the scanning. :) Not too bad, but I'm
thinking of using a 24*4kentry cache. 6 times larger, so that scanning
overhead becomes unpleasant. Thats why I was thinking

>
> Thanks a lot! You've got a couple of good catches there.
>

Profiling can find stupidities in the strangest of places.. Note that I
profiled a simple development environment, me writing some code and
browsing code. Other types of environments may appear different.

Scott

--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.