Where to next with Exupery?

Sun May 15 10:32:30 UTC 2005

Tim Rowledge writes:
 > 
 > > It's a source for benchmarks, I could also borrow them from other
 > > places. But what I'm interested in is benchmarks with a practical
 > > benefit to the Squeak community.
 > >
 > There is a fairly sizable set of benchmarks on SM. Like any BMs they are of
 > limited use but at least its a common set that exercises a very large part of
 > the system. The nice thing about big benchmarks is that they rapidly deflate
 > ones fantasies about how a tiny tweak will be a huge win. :-(  They do however
 > nicely justify careful work to make gradual improvements.

Hmm, they don't work. The compilation and the VMMaker macro tests do
work. I'll play with those.

Currently methods need to be explicitly compiled for each receiver
class. I'm planning on driving compilation by some form of background
profiling. Hopefully, we can find somewhere where Exupery is useful
without needing automatic dynamic compilation. If not then automatic
compilation is the next major goal.

This limits Exupery to a few inner loop methods. More could be
compiled if there was a good way to find them.

But the real question is what is the minimal amount I need to do to
make Exupery generally useful (excluding ports)?

 > If you have made serious gains in send/return performance without compromising
 > performance of many processes (think Tweak) then that would be very interesting
 > to find a way to get into the mainline VM. Likewise a faster prim activation
 > (I've already made the prim dispatch about as fast as a not-inlined PMC can be,
 > I think) sequence would be a general win.

I haven't yet made serious gains in send/return performance, I've only
doubled the performance. That's going from compiled code to compiled
code. Dynamically inlining should lead to serious gains. ;-)

There is nothing in my optimisations that would effect process
switching. The only issue is compiled code doesn't yet to an interrupt
check, but that could be added fairly easily. 

 > I suspect many prims could be made faster with some effort in better control
 > flow and less (ab)use of success: etc. Some important prims could be written in
 > pseudocode suitable for your translator so that they could be inlined and avoid
 > the primcalling glue altogether. VW uses this to good effect.

One problem with prims is they have to deal with a large number of
receiver classes. I've changed my #at: implementation to use
specialised primitives. You need to compile a version of the primitive
for each receiver class that you use. That should be a good strategy
for primitives like #at: and #new. Then all specialisation for the
receiver class can be done at compile time.

I'll need to inline the specialised primitives to gain from the new
#at: implementation (currently it's only 1.5 times faster than the
interpreter). But that's a general framework that can be easily
applied to other primitive operations.

It would be possible to inline the call to the prim leaving just a
type check followed by a C function call. But I'd need a decent
benchmark demonstrating why it's causing real performance problems to
drive development.

 > And of course you have to make it all work nicely on ARM cpus or a substantial
 > fraction of the world population of cpus will be unable to make use of your
 > hard work. There may well be several hundred million x86 based desktop machines
 > around the world but I understand that over a billion ARMs were sold just last
 > year ;-)

Are you volunteering? ;-) An ARM port is planned, I even have an ARM
device (an iPAQ) that runs Squeak occasionally.

Bryce