Performance profiling results...

Sat Sep 22 19:40:00 UTC 2001

Scott,

Interesting results!

> Over 99% of the invocations of the function 'fetchClassOf' are when
> invoked from 'commonSend' in the interpreter. (approximately
> 350 million).

Yup. I see the culprit, it's

	receiverClass := lkupClass := self fetchClassOf: rcvr.

which doesn't get inlined properly. Should be rewritten as

	lkupClass := self fetchClassOf: rcvr.
	receiverClass := lkupClass.

to eliminate the call.

> ioLowResMSecs and ioMSecs (sqXWindow.c:1827) were invoked 150 million
> times, or about 40,000/sec. Syscalls are cheap, but still. :)
[...]
> Eliminating this call would probably help things a lot.

For speed, yes. But unfortunately, various primitives do take some time
(just think about BitBlt, sound generation etc) and if they do it is at
times critical to capture it right after the primitive invokation. So
there's not a lot we can do except from trying to make ioLowResMSecs as
cheap as somehow possible (its only intend is to figure out if we have spend
a "significant" amount of time or not).

> Within primitiveResponse, all of the primitiveFOO functions it invokes
> (excluding primitiveFloatDividebyArg primitiveMakePoint
> primitiveBitShift) are only invoked by it itself. Thats about 76 million
> outgoing function calls..

Hm ... interesting point. You mean we should try to inline them like the
bytecodes?! Could make sense for various of the "quick" primitives.

> Also, about 90% of the invocations of fetchClassOfNonInt
> are from loadFloatOrIntFrom. And 99.5% of the invocations of
> floatValueOf are from fetchClassOfNonInt. About 100 million function
> calls total.

Yeah, would be nice to get rid of 'em but the inliner currently prevents
inlining methods that have explicit return types - and #loadFloatOrIntFrom:
is a #returnTypeC: 'double'.

[Re: MCache]
> But I was surprised to find that the cache entries started at 1,
> instead of zero (introducing a lot of extra instructions), that it had
> only 256 entries, and that it wasted 37% of its space...

How did you find this (btw, the mcache has 512 entries not 256)?!

> I would have thought you'd have something along the lines of
> two tables:

I don't know enough about the various processor caches but I'd like to see
performance measures with your layout. Shouldn't be hard to do.

> With my table design, a 1k entry cache would require 8kb
> of space for the lookuptable, and 16kb for the methoddata.

But how fast is it?! ;-)

> Taking a guess that it is branch misprediction how does this sound?

Dunno. Try it - change MethodCacheEntries in Interpreter, compile a new VM
and profile.

[Re: GC]
> The penalties of having a large methodcache are that there is more to
> flush on GC's... Which reminds me, why is 'allocations
> between gc's' still set at 4000?

It works. And nicely at that even on small machine. Heck, it even worked
nicely on my 486DX2/66 (if anyone remembers what this is ;-) which I used
for porting Squeak.

> Anyone up for changing the default to 40000?  (still under 10ms)

Hm ... not really. At least not as long you can't show a serious performance
improvement. Which I find hard to believe, mostly because John Mc recently
changed the IGC mcache flush to be selective (e.g., flush only entries that
are actually in GC range). So except from enumerating the cache there's not
a lot going on wrt GC.

> This is the stuff that seems really out of place in an hour
> scan of it. I can feed the full profiling output if anyone is
> interested. (about 150kb compressed)

Thanks a lot! You've got a couple of good catches there.

Cheers,
  - Andreas