YASJITCI -- Yet Another Squek Just In Time Compiler Idea
Andrew Berg
andrew_c_berg at yahoo.com
Fri Mar 28 22:54:44 UTC 2003
I'm writing this for two reasons: (1) To get some expert opinion on the
feasibility of my idea, and (2) to distract this otherwise quite enjoyable
list from its semiannual not-quite-flamewar-over-licensing. (That's a
little joke, but it's probably only funny because it hurts so much.)
Anyhow, this idea occurred to me about a month ago, and so far I have not
figured out why it would not work, so I'm posting it up here for you all to
shoot holes in so I can get back to my day job.
Write the JIT compiler in Squeak itself. This is similar to what Bryce has
been working on with Exupery, but I propose that simplicity is the order of
the day. FORTH, where I spent a reasonable amount of my misspent youth, is
pretty darn fast at runtime because all of the subroutine calls are
lightweight, and the compiled "methods" are natively executable. I don't
think we can make the method calls as lightweight as FORTH, nor do we
really want to. But we can write compiled methods that are natively
executable. The Exupery way is to write a real compiler, and there is my
way (which I make no bones about: It will not perform as well) which is to
have a table of bytecode implementations, which just get chained together
between some somewhat generic prefix/postfix blocks, and then get the
various jump addresses fixed up. For message sends, it seems obvious that
we want to inline when the receiver's implementation is short enough, and
we have a high enough confidince that it is the right one, to cache the
receiver's implementation address when it is not, and to emit fallback-code
for the rest.
Proper register allocation is an important thing that I am almost
completely throwing away. Years ago, when I wrote a FORTH compiler, it
worked by emitting little chunks of code which were hand-written in
assembly. My initial implementation assumed that at each chunk boundary
there would be the top three stack elements in three specific registers,
and that the pointer to the rest of the data stack lived in another
register. What I quickly found was that it was often convenient to also be
able to have one, two or four top values in registers, so I tried again
with a slightly more complicated table where I kept track of how many stack
elements were in registers and then provided multiple implementations of
each basic word (think: bytecode) with various numbers of values in
registers as preconditions, and with the most efficient number for the word
as the postcondition. Along with a little table of nop instructions for
filling holes in my hand-written table, it worked pretty good. My gut
feeling is that this approach will give us XX% of what a full blown
register allocation scheme will, for xx% the cost, where XX and xx are
probably something around 85 and 15 respectively, just like everything
else.
What will it do for us performancewise?
Well, on deeply pipelined superscalar cpu's, the worst thing you can do is
to introduce a bubble in the pipe, and the easiest way to introduce a big
bubble is to miss a branch prediction. Strictly speaking, a memory access
that misses all the way out to the swapfile is probably bigger, but let's
assume we can't do much about that case. Right now, we are probably
suffering a branch mispredict on every bytecode, since the case statement
implementing the bytecode interpreter is branching to all of the bytecodes
from a single branch instruction. By naively chaining together sequences
of machine code, without a doubt we will be generating more instructions
than necessary, but with multiple instructions being dispatched per clock,
the cost of these is quite low in comparison.
What will it require for implementation?
Well, as I see it, code in this implementation has three levels, and maybe
more. Source code, compiled to bytecode and compiled to machine code are
the obvious ones. In order to produce receiver class metrics for message
send optimizing and inlining, it will probably be handy to add a fourth
one, between bytecodes and machine code. Whether this layer should be
bytecode or machine code, I don't really know. If message recipient's
implementation addresses are to be placed directly into the machine code as
jump addresses, then they need special handling from the VM to prevent
movement during GC. I thought maybe if the jump addresses were stored in
the constants table, then the VM would know to update them and everything
would be fine. Yeah, there is an extra lookup for the jump indirect, but
as long as branch prediction is usually correct it is probably OK.
As I see it, the thing to do is to put a second layer of compiled to
bytecode to gather receiver classes. Could we dedicate a seperate set of
send bytecodes for this? Once the receivers are known, generate new
bytecode which uses a new version of send something like 'sendMessage w to
x ifItsClassMatches y returningTo z' and leaves a new return address which
skips over the code for a regular, unknown send. Then this bytecode is
what we use chunk-chaining to create executable machine code for.
Ok, there's the idea. Anyone out there understand where I am trying to go
with this? Anyone out there can tell me why it is a dumb idea?
By the way, the last project I spent about 2 months worth of evenings
working on was embedding Mozilla into Squeak. I gave up because: I
realized that I really don't want to fiddle with Mozilla. It is a big,
ugly, complicated beast, and it is written in C++. I use C++ at work
because they pay me to. The last thing I want to do is to use it at home.
:)
-andrew
--
andrew_c_berg at yahoo.com
More information about the Squeak-dev
mailing list
|