"Just Curious" VM question

Tue Sep 16 01:29:15 UTC 2003

Hi Tim,

> Well we could sneak a peek at how VW does it.

Well, I would expect that in the presence of a JIT even the overhead of a
cross-fragment call is less than an indirect dispatch through a table
lookup. In addition, VW doesn't use named prims excessively so I'm sure the
native code can make all sorts of assumptions about primitives in general. 

Oh, and I am not at all certain if the ABI of OSX is still exactly the same
(I guess I need to have a look at the FFI plugin to see if that "reload base
pointer" stuff is still in there).

> I imagine (accent on imagine here, I haven't bothered to
> check anything) that internal, numbered prims would be in
> the same segment doohickey. If so that would avoid that cost.

Only if you have a way to tell the compiler that "you stupid compiler I know
exactly what I am doing here stop slowing me down" - which I tried in a
variety of (portable) ways, none of which showed any effect. Oh well...

> If we pay a few cycles for jumping to functions in
> plugins, well aren't we already doing so? They're already called by
> pointer. Anyway, so what if it slows down Macs a couple of percent.
> Nobody uses them. They don't matter. To hell with anything not intel.
> :-J

Yay! I'm all for it ;-J

> Oh, and doesn't 'gnuify' sorta demonstrate that this isn't a problem?
> It converts the bytecode switch statement to use sort-of addresses
> doesn't it?
> And thirdly of my two points, so if it's a problem for PPC, fudge the
> CCG so it doesn't get used for PPC.

Well, I'd guess that keeping all of the code paths duplicated is more
complications then we'd really want to manage - my feeling here would be to
choose either one or the other way for having at least a reasonable way of
figuring out what the hell the VM is doing.

> Not sure I agree with this; if the quick prims are converted 
> to a bunch of teensy functions and the pointers are cached as usual,
> what would be the problem?

No problem really. Just a few cycles spent doing the externalizeIPandSP, a
few cycles for the function call itself, a few cycles for the branch
misprediction. Could still be (even significantly) better than the current
scheme of nested if's though - hard to say.

> > But then, unless there's a _need_ for it I don't see any
> > point wasting effort to make things
> > infinitely fast in places where it doesn't count ;-)
> Depends on what the cost is really, doesn't it. If, for example I were
> to just do it there really wouldn't be any meaningful cost. Nobody is
> paying for _any_ of this stuff right now. At least, not to my bank a/c
> and I couldn't care less about anyone else's at this stage of 
> the game.

Well, okay, but my feeling is still that we should be careful about touching
these areas which we have crafted for years just because it's fun. Not that
I mind it if someone's really into it just that if you do it you need to be
aware that it may turn out to be a wrong path and might not get adopted (see
the mcache changes which were discussed and rejected).

> > Not sure I understand this. Can you elaborate?
> Sure; consider sending some version of #at:. We do the lookup, find a
> ref to prim 60 (or whatever) and cache 60. When we execute the prim we
> have to check for the format and do the right thing. Every damn time.

Ah, I see (btw, Ian dropped me an off-list note explaining exactly that).
Yes, this would most definitely be useful. And potentially (given the
typical number of times these prims get invoked) even outweigh any extra
overhead introduced through the indirect function calls.

> With some (usually trivial) code generation one could create a lot of
> the tiny prims as translated functionelles during startup and save a
> lot of ip/sp back and forth in some crucial places. Having a system
> that uses function ptrs in the lookup cache(s) makes it easier to use
> something like that.

Hm .... good point. Indeed, a VERY good point. In fact that point is so
incredibly good that one may claim that _IF_ a VM has that overhead for
certain kinds of function calls that VM itself should implement appropriate
means. For example, the Mac VM could use a teenie weenie bit of assembly to
call those prims _without_ the stupid Mac-doohickey for cross-fragment calls
and rather have that bit of glue be generated from, e.g., the lookup
mechanism in #ioFindFooBarIn. This would make sure that all prims where the
lookup can identify it as "local" we get the fastest version theoretically
possible while still ensuring the proper means for the ABI in general.

> The biggest thing we could do right now for performance is get
> something sorted out about a better send/return mechanism. It's not
> like we're short ofexamples. BHH had an excellent system and Ian did a
> faintly similar one for Squeak several years ago before even J3. AJH
> did an interesting if slightly convoluted one. VW, J5 etc all have
> interesting approaches. My guess is that we could get a major
> macrobenchmark improvement with something quite simple. 

Yes, I have been suspecting this for a long time - if you look at the
benchmarks they don't look right to me if I look at the actual code being
executed. I can see (at least on Intel) where the bytecode speed goes
(mostly into branch misprediction+fetchNextBytecode) but the time spend in
send/return just doesn't match what I have in front of my eyes. This was
partly what initiated the message I wrote to AJH.

Cheers,
  - Andreas