"Just Curious" VM question

Tim Rowledge tim at sumeru.stanford.edu
Tue Sep 16 06:12:02 UTC 2003


"Andreas Raab" <andreas.raab at gmx.de> wrote:

> 
> > Not sure I agree with this; if the quick prims are converted 
> > to a bunch of teensy functions and the pointers are cached as usual,
> > what would be the problem?
> 
> No problem really. Just a few cycles spent doing the externalizeIPandSP, a
> few cycles for the function call itself, a few cycles for the branch
> misprediction. Could still be (even significantly) better than the current
> scheme of nested if's though - hard to say.
So don't do the externalizeIPandSP. Pass them as args; we've already
gone to some trouble to get them into registers whenever possible so
calling primAt(ip, sp) or whatever isn't likely to be a problem. Except
perhaps on x86 and who the hell cares about _that_ particular
excrescence of a so called processor, that's what I want to know.

> Well, okay, but my feeling is still that we should be careful about touching
> these areas which we have crafted for years just because it's fun.
Doing this because it's fun is just about the only reason to do it.
It's certainly time some actual performance improvements got done in
send/return.  There's been far too much pontificating and posturing and
far too little actual bellying up to the bar.


> 
> > With some (usually trivial) code generation one could create a lot of
> > the tiny prims as translated functionelles during startup and save a
> > lot of ip/sp back and forth in some crucial places. Having a system
> > that uses function ptrs in the lookup cache(s) makes it easier to use
> > something like that.
> 
> Hm .... good point. Indeed, a VERY good point. In fact that point is so
> incredibly good that one may claim that _IF_ a VM has that overhead for
> certain kinds of function calls that VM itself should implement appropriate
> means. For example, the Mac VM could use a teenie weenie bit of assembly to
> call those prims _without_ the stupid Mac-doohickey for cross-fragment calls
> and rather have that bit of glue be generated from, e.g., the lookup
> mechanism in #ioFindFooBarIn. This would make sure that all prims where the
> lookup can identify it as "local" we get the fastest version theoretically
> possible while still ensuring the proper means for the ABI in general.
The key point is to cut out all extraneous crap. VW does it by
translating pseudocode sequences to native code during startup and
having no need for 'gluecode' at runtime. The translator of couse knows
where all the bodies are buried and makes use of that to allow primAt
(etc) to be very optimised. We can do something vaguely similar. As a
very trivial example, many years ago (shortly after ones and zeros were
invented), I did an assembler hack to the BHH reference counting code
that used 'private knowledge' of the system to get reference
decrementing down from a full function call and over 100 cycles to less
than 10 total. Reduced VM size 10% and increased _macro_ benchmark
performce by nearly 10%.
We don't have a suitable translator right now but for a small number of
important routines it is not a huge job to do by hand.

tim
--
Tim Rowledge, tim at sumeru.stanford.edu, http://sumeru.stanford.edu/tim
May the bugs of many programs nest on your hard drive.



More information about the Squeak-dev mailing list