Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Tue Dec 19 17:07:14 UTC 2006

On 12/18/06, bryce at kampjes.demon.co.uk <bryce at kampjes.demon.co.uk> wrote:
>
> David Griswold writes:
> > Sends that have more than 4 receiver types, such as your
> micro-benchmark,
> > can't even use PICs or any kind of inline cache, so these are a full
> > megamorphic send in Strongtalk, which is implemented as an actual hashed
> > lookup, which is the slowest case of all.  You might say that is what
> > Smalltalk is all about, but in reality megamorphic sends are relatively
> rare
> > as a percentage of sends.  Compilers aren't magic- no one can eliminate
> the
> > fundamental computation that a truly megamorphic send has to do- it
> *has* to
> > do some kind of real lookup, and a call, so the performance will
> naturally
> > be similar across all Smalltalks.
>
> I'm fairly sure that VisualWorks has a hash PIC that it uses for
> mega-morphic sends. Eliot talked about this at Smalltalk Solutions.
> I also doubt that VW does any advanced optimizations such as global
> code motion (moving type-checks out of loops) or loop unrolling. If it
> did it would be faster than Exupery for the bytecode benchmark.

I don't know exactly the details on how VW's hash PICs work, but I think my
original comment holds: since both Strongtalk and VW do hashing for
megamorphic sends, and type-feedback doesn't help Strongtalk for this case,
I would expect them to be fairly similar in performance, modulo standard
code quality issues that would reflect the level of tuning in the compiler.

It is quite possible that VW doesn't do loop unrolling (which is why my
original post put less confidence on that), but I am pretty sure they do
array bounds-check removal, which as I said I would expect would account for
a good chunk of any performance difference (although at the moment we don't
actually have comparative numbers, since no one has run both VW and compiled
Strongtalk on the same machine on this benchmark).

Strongtalk should be able to move the Array access type-test out of the
loop; I had assumed that VW could do that too, since it seems like a
relatively easy thing to do.

However, in this case if you're actually compiling your benchmark in
> Strongtalk it's possible that the performance difference between VW and
> Strongtalk is the method specialization done by Strongtalk.
>
> Strongtalk, AFAIK, compiles a version for each receiver for a
> method. This is an optimization because it allows more precise type
> information to be gathered as it's not polluted by other classes use
> of an inherited method. Specializing methods by receiver should also
> allow faster inlining of self sends as they can be fully resolved at
> compile time. (1)
>
> Having a separate compiled method for every receiver may be doing bad
> things to your CPU's instruction cache. That could be where
> Strongtalk's lack of performance here is coming from. First level
> instruction caches are small, the largest on a desktop CPU is only
> 64Kb. If you want to find out then it is possible to measure cache
> misses unfortunately I only know how to do this under Linux.

I doubt the instruction cache is the issue here, since the only customized
methods involved are a few different versions of #yourself, which does
nothing but return self, so the methods should only be a few instructions
long.  It should take a lot more than that to thrash the instruction
cache.

And in general in Strongtalk, the code duplication caused by customization
is counteracted by the fact that only hotspot code is compiled in the first
place, unlike VW.  The entire compiled code cache in Strongtalk for all code
in the image is rarely bigger than 2-4 megabytes total, which is probably
smaller than VW's code cache.  There is probably a bit more instruction
cache pressure in Strongtalk, but we've never seen anything that looked like
a performance hit because of it, since all that really matters is whether
the inner-loop working set of the moment set thrashes or not, not the whole
code cache.

Microbenchmarks are getting less reliable as compilers and hardware
> becomes smarter.

Absolutely!

Cheers,
Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20061219/9920f42e/attachment.htm