YASJITCI -- Yet Another Squek Just In Time Compiler Idea - Squeak-dev

28 Mar 2003


      I'm writing this for two reasons:  (1) To get some expert opinion on the 
feasibility of my idea, and (2) to distract this otherwise quite enjoyable 
list from its semiannual not-quite-flamewar-over-licensing.  (That's a 
little joke, but it's probably only funny because it hurts so much.)
Anyhow, this idea occurred to me about a month ago, and so far I have not 
figured out why it would not work, so I'm posting it up here for you all to 
shoot holes in so I can get back to my day job.
Write the JIT compiler in Squeak itself.  This is similar to what Bryce has 
been working on with Exupery, but I propose that simplicity is the order of 
the day.  FORTH, where I spent a reasonable amount of my misspent youth, is 
pretty darn fast at runtime because all of the subroutine calls are 
lightweight, and the compiled "methods" are natively executable.  I don't 
think we can make the method calls as lightweight as FORTH, nor do we 
really want to.  But we can write compiled methods that are natively 
executable.  The Exupery way is to write a real compiler, and there is my 
way (which I make no bones about:  It will not perform as well) which is to 
have a table of bytecode implementations, which just get chained together 
between some somewhat generic prefix/postfix blocks, and then get the 
various jump addresses fixed up.  For message sends, it seems obvious that 
we want to inline when the receiver's implementation is short enough, and 
we have a high enough confidince that it is the right one, to cache the 
receiver's implementation address when it is not, and to emit fallback-code 
for the rest.
Proper register allocation is an important thing that I am almost 
completely throwing away.  Years ago, when I wrote a FORTH compiler, it 
worked by emitting little chunks of code which were hand-written in 
assembly.  My initial implementation assumed that at each chunk boundary 
there would be the top three stack elements in three specific registers, 
and that the pointer to the rest of the data stack lived in another 
register.  What I quickly found was that it was often convenient to also be 
able to have one, two or four top values in registers, so I tried again 
with a slightly more complicated table where I kept track of how many stack 
elements were in registers and then provided multiple implementations of 
each basic word (think: bytecode) with various numbers of values in 
registers as preconditions, and with the most efficient number for the word 
as the postcondition.  Along with a little table of nop instructions for 
filling holes in my hand-written table, it worked pretty good.  My gut 
feeling is that this approach will give us XX% of what a full blown 
register allocation scheme will, for xx% the cost, where XX and xx are 
probably something around 85 and 15 respectively, just like everything 
else.
What will it do for us performancewise?
Well, on deeply pipelined superscalar cpu's, the worst thing you can do is 
to introduce a bubble in the pipe, and the easiest way to introduce a big 
bubble is to miss a branch prediction.  Strictly speaking, a memory access 
that misses all the way out to the swapfile is probably bigger, but let's 
assume we can't do much about that case.  Right now, we are probably 
suffering a branch mispredict on every bytecode, since the case statement 
implementing the bytecode interpreter is branching to all of the bytecodes 
from a single branch instruction.  By naively chaining together sequences 
of machine code, without a doubt we will be generating more instructions 
than necessary, but with multiple instructions being dispatched per clock, 
the cost of these is quite low in comparison.
What will it require for implementation?
Well, as I see it, code in this implementation has three levels, and maybe 
more.  Source code, compiled to bytecode and compiled to machine code are 
the obvious ones.  In order to produce receiver class metrics for message 
send optimizing and inlining, it will probably be handy to add a fourth 
one, between bytecodes and machine code.  Whether this layer should be 
bytecode or machine code, I don't really know.  If message recipient's 
implementation addresses are to be placed directly into the machine code as 
jump addresses, then they need special handling from the VM to prevent 
movement during GC.  I thought maybe if the jump addresses were stored in 
the constants table, then the VM would know to update them and everything 
would be fine.  Yeah, there is an extra lookup for the jump indirect, but 
as long as branch prediction is usually correct it is probably OK.
As I see it, the thing to do is to put a second layer of compiled to 
bytecode to gather receiver classes.  Could we dedicate a seperate set of 
send bytecodes for this?  Once the receivers are known, generate new 
bytecode which uses a new version of send something like 'sendMessage w to 
x ifItsClassMatches y returningTo z' and leaves a new return address which 
skips over the code for a regular, unknown send.  Then this bytecode is 
what we use chunk-chaining to create executable machine code for.
Ok, there's the idea.  Anyone out there understand where I am trying to go 
with this?  Anyone out there can tell me why it is a dumb idea?
By the way, the last project I spent about 2 months worth of evenings 
working on was embedding Mozilla into Squeak.  I gave up because:  I 
realized that I really don't want to fiddle with Mozilla.  It is a big, 
ugly, complicated beast, and it is written in C++.  I use C++ at work 
because they pay me to.  The last thing I want to do is to use it at home.  
:)
-andrew
-- 
andrew_c_berg@yahoo.com