"Just Curious" VM question

Tim Rowledge tim at sumeru.stanford.edu
Tue Sep 16 00:43:26 UTC 2003


"Andreas Raab" <andreas.raab at gmx.de> wrote:

> Been there done that ;-) Turns out that some ABIs (==MacOS) have a really
> stupid overhead when a function call goes "cross-fragment" and the burden to
> determine this is with the caller and not the callee. In other words, if you
> call a "function through pointer" you have a measurable overhead when you do
> it often enough.
Well we could sneak a peek at how VW does it. I imagine (accent on
imagine here, I haven't bothered to check anything) that internal,
numbered prims would be in the same segment doohickey. If so that would
avoid that cost. If we pay a few cycles for jumping to functions in
plugins, well aren't we already doing so? They're already called by
pointer. Anyway, so what if it slows down Macs a couple of percent.
Nobody uses them. They don't matter. To hell with anything not intel.
:-J
Oh, and doesn't 'gnuify' sorta demonstrate that this isn't a problem?
It converts the bytecode switch statement to use sort-of addresses
doesn't it?
And thirdly of my two points, so if it's a problem for PPC, fudge the
CCG so it doesn't get used for PPC.



> Given that all of
> the numbered prims are the "truly time-critical ones" it didn't seem worth
> the effort (and also, named prims were new and we hadn't thought about how
> to make them really fast so the "default" at this time was still numbered).
> Also, we use some of the primitive numbers right in the interpreter loop
> (like the quick prims) which would be hard to do based on primitive address.
Not sure I agree with this; if the quick prims are converted to a bunch
of teensy functions and the pointers are cached as usual, what would be
the problem? Aside from having a gazillion teeny functions, of course.
Oh, trivial point, not all the numbered prims are time critical right
now but we could certainly (and I would argue _should_) have it that
way.
> 
> Of course, if we wanted to get really fancy we could declare that no named
> primitive must ever use the 1k lower range (which I think is reasonable) and
> just store the address of the named function anyway. But then, unless
> there's a _need_ for it I don't see any point wasting effort to make things
> infinitely fast in places where it doesn't count ;-)
Depends on what the cost is really, doesn't it. If, for example I were
to just do it there really wouldn't be any meaningful cost. Nobody is
paying for _any_ of this stuff right now. At least, not to my bank a/c
and I couldn't care less about anyone else's at this stage of the game.

> 
> > Further advantages; could cache more specialized
> > versions of byteAt/put/wordat/put/etc thus
> > speeding up the frequent scanning of homogeneous lists.
> 
> Not sure I understand this. Can you elaborate?
Sure; consider sending some version of #at:. We do the lookup, find a
ref to prim 60 (or whatever) and cache 60. When we execute the prim we
have to check for the format and do the right thing. Every damn time.
If we looked as part of the caching operation and cached
prim60versionByte then we can assume the byteness next time. A little
here, a little there and pretty soon we're talking serious gravy.
Seemed worth it in BrouHaHa.

> 
> Completely OT point here but one of the things where I _would_ dramatically
> like the ability to "store a function address" in the mcache is for the FFI
> - with a bit of help by some native code generation this could make
> marshalling incredibly fast with only very little help by something like
> Ian's ccg ;-)
Which is basically a much extended of my point above. Not at all OT.

With some (usually trivial) code generation one could create a lot of
the tiny prims as translated functionelles during startup and save a
lot of ip/sp back and forth in some crucial places. Having a system
that uses function ptrs in the lookup cache(s) makes it easier to use
something like that. 

The biggest thing we could do right now for performance is get
something sorted out about a better send/return mechanism. It's not
like we're short ofexamples. BHH had an excellent system and Ian did a
faintly similar one for Squeak several years ago before even J3. AJH
did an interesting if slightly convoluted one. VW, J5 etc all have
interesting approaches. My guess is that we could get a major
macrobenchmark improvement with something quite simple. 

tim
--
Tim Rowledge, tim at sumeru.stanford.edu, http://sumeru.stanford.edu/tim
Oxymorons: Childproof



More information about the Squeak-dev mailing list