I'm writing this for two reasons: (1) To get some expert opinion on the feasibility of my idea, and (2) to distract this otherwise quite enjoyable list from its semiannual not-quite-flamewar-over-licensing. (That's a little joke, but it's probably only funny because it hurts so much.)
Anyhow, this idea occurred to me about a month ago, and so far I have not figured out why it would not work, so I'm posting it up here for you all to shoot holes in so I can get back to my day job.
Write the JIT compiler in Squeak itself. This is similar to what Bryce has been working on with Exupery, but I propose that simplicity is the order of the day. FORTH, where I spent a reasonable amount of my misspent youth, is pretty darn fast at runtime because all of the subroutine calls are lightweight, and the compiled "methods" are natively executable. I don't think we can make the method calls as lightweight as FORTH, nor do we really want to. But we can write compiled methods that are natively executable. The Exupery way is to write a real compiler, and there is my way (which I make no bones about: It will not perform as well) which is to have a table of bytecode implementations, which just get chained together between some somewhat generic prefix/postfix blocks, and then get the various jump addresses fixed up. For message sends, it seems obvious that we want to inline when the receiver's implementation is short enough, and we have a high enough confidince that it is the right one, to cache the receiver's implementation address when it is not, and to emit fallback-code for the rest.
Proper register allocation is an important thing that I am almost completely throwing away. Years ago, when I wrote a FORTH compiler, it worked by emitting little chunks of code which were hand-written in assembly. My initial implementation assumed that at each chunk boundary there would be the top three stack elements in three specific registers, and that the pointer to the rest of the data stack lived in another register. What I quickly found was that it was often convenient to also be able to have one, two or four top values in registers, so I tried again with a slightly more complicated table where I kept track of how many stack elements were in registers and then provided multiple implementations of each basic word (think: bytecode) with various numbers of values in registers as preconditions, and with the most efficient number for the word as the postcondition. Along with a little table of nop instructions for filling holes in my hand-written table, it worked pretty good. My gut feeling is that this approach will give us XX% of what a full blown register allocation scheme will, for xx% the cost, where XX and xx are probably something around 85 and 15 respectively, just like everything else.
What will it do for us performancewise?
Well, on deeply pipelined superscalar cpu's, the worst thing you can do is to introduce a bubble in the pipe, and the easiest way to introduce a big bubble is to miss a branch prediction. Strictly speaking, a memory access that misses all the way out to the swapfile is probably bigger, but let's assume we can't do much about that case. Right now, we are probably suffering a branch mispredict on every bytecode, since the case statement implementing the bytecode interpreter is branching to all of the bytecodes from a single branch instruction. By naively chaining together sequences of machine code, without a doubt we will be generating more instructions than necessary, but with multiple instructions being dispatched per clock, the cost of these is quite low in comparison.
What will it require for implementation?
Well, as I see it, code in this implementation has three levels, and maybe more. Source code, compiled to bytecode and compiled to machine code are the obvious ones. In order to produce receiver class metrics for message send optimizing and inlining, it will probably be handy to add a fourth one, between bytecodes and machine code. Whether this layer should be bytecode or machine code, I don't really know. If message recipient's implementation addresses are to be placed directly into the machine code as jump addresses, then they need special handling from the VM to prevent movement during GC. I thought maybe if the jump addresses were stored in the constants table, then the VM would know to update them and everything would be fine. Yeah, there is an extra lookup for the jump indirect, but as long as branch prediction is usually correct it is probably OK.
As I see it, the thing to do is to put a second layer of compiled to bytecode to gather receiver classes. Could we dedicate a seperate set of send bytecodes for this? Once the receivers are known, generate new bytecode which uses a new version of send something like 'sendMessage w to x ifItsClassMatches y returningTo z' and leaves a new return address which skips over the code for a regular, unknown send. Then this bytecode is what we use chunk-chaining to create executable machine code for.
Ok, there's the idea. Anyone out there understand where I am trying to go with this? Anyone out there can tell me why it is a dumb idea?
By the way, the last project I spent about 2 months worth of evenings working on was embedding Mozilla into Squeak. I gave up because: I realized that I really don't want to fiddle with Mozilla. It is a big, ugly, complicated beast, and it is written in C++. I use C++ at work because they pay me to. The last thing I want to do is to use it at home. :)
-andrew
Andrew Berg andrew_c_berg@yahoo.com wrote: [SNIP of JIT-ideas that others like Markus Denker, Ian Piumarta etc knows much more about]
By the way, the last project I spent about 2 months worth of evenings working on was embedding Mozilla into Squeak. I gave up because: I
Interesting. Wasn't that what Exobox did too? I think so. Anyway, even if it didn't work out it would be interesting to hear the story. At least we could avoid others trying to do the same! :-)
regards, Göran
I tried to send this earlier, but I got a mailer error, so let's try again:
On Sat, 29 Mar 2003 21:33:32 +0100, goran.hultgren@bluefish.se wrote:
Andrew Berg andrew_c_berg@yahoo.com wrote: [SNIP of JIT-ideas that others like Markus Denker, Ian Piumarta etc knows much more about]
By the way, the last project I spent about 2 months worth of evenings working on was embedding Mozilla into Squeak. I gave up because: I
Interesting. Wasn't that what Exobox did too? I think so. Anyway, even if it didn't work out it would be interesting to hear the story. At least we could avoid others trying to do the same! :-)
regards, Göran
There's really not much of a story. I started into it, made a primitive for calling the XPCOM's version of IDispatch (if that does not make sense to you, don't worry about it) and started writing the code to parse out their type library. About then I realized that there was a fundamental flaw in what I was planning to do. I was planning to embed an X window into Squeak, so that Mozilla's output could show up inside of Squeak, and then use XPCOM to control it from Squeak. So far, so good, but it would not really look much like it was running inside of Squeak. It would really be running on top of Squeak, so Squeak windows would not be able to overlap the Mozilla windows, etc. Also, the display managment aspect would only work for X. I could probably cobble together something for Windows as well, but I would have ended up writing a bunch of platform-specific code in order to use a "platform independant" browser along with a platform independant language. And it would not have ended up even working that well together. Plus, in retrospect, I probably should have written a C wrapper for the XPCOM version of IDispatch so that I could call it with plain old FFI. Also, how to write the part to allow XPCOM components to send messages to Smalltalk objects was unclear in my head, which meant that it was probably going to end up being a bad design, and I would have gotten to rewrite it a couple of times in order to figure out how not to do it. That's fine for stuff I'm getting paid to do, but for freebie/volunteer stuff I'd prefer to be able to finish it. Right now I am (absurdly enough) contemplating: 1) Implementing (at least a working subset of) SSL in PureSqueak. 2) Fiddling around with the VM and trying to write a JIT (almost) completely in Squeak. Well, it would be completely in Squeak, I guess, but a little bit of it would be in slang and influencing the VM, but most of it would just be Plain Old Smalltalk. 3) Embedding Perl into Squeak using FFI and libperl.so (my Linux box only has libperl.a, but I assume that with a little massaging I can make a libperl.so). Why would I want to do that? Well, DBI is an example of something that Perl has (and has a lot of) that IMHO Squeak could use a bit more of. It would be real whizzy if I could make Perl proxy objects which somehow tie back to Squeak objects, but that does not look quite as easy. Maybe a real Perl hacker could show me how easy it really is.
The trouble is that any one of these is a large enough project that I'd be working on it for months or years before it was usable, by which time Squeak would probably be up to version 5 or 6 at the rate things go! -andrew
squeak-dev@lists.squeakfoundation.org