2009/5/14 Jecel Assumpcao Jr jecel@merlintec.com:
Thanks Eliot and Igor for your comments!
Eliot is right that what Igor wrote is a good way to do a JIT but the problem is that my hardware is essentially an interpreter and must deal at runtime with expressions that the JIT would optimize away.
yes, i was proposed it for JIT only. For bytecode interpteter such separation really could be much less effective.
What is bad, that to test the idea (how much its more [not]effective comparing to single stack) it requires a huge implementation effort. So, its a bit risky spend many hours changing the code generator & data formats only to discover that in practice, it gives much less significant benefits than expected :)
My design is described in:
http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details
What is missing from that description is what the control registers (x0 to x31) do, and that depends on the stack organization. Certainly it is possible to have three registers as Eliot suggested. A separate control stack is not only used on Forth processors but was also popular in Lisp Machine designs.
Registers t0 to t29 are really implemented through a small amount of hardware as a range of words in the stack cache. Having to split this set differently for each method would make the hardware larger and slower. With a separate control stack the mapping is really simple.
About the JIT having to use call which pushes the PC, that is not true on RISC processors. But I suppose the focus for Cog is making Squeak fast on the x86.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
That is specially a good idea if the same VM ends up being used for other languages like Newspeak. Certainly the Java VM runs many languages that the original designers never expected.
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Ok, so the JIT VM will still have an interpreter. Self originally didn't have one but most of the effort that David Ungar put into the project when it was restarted was making it more interpreter friendly. The bytecodes became very similar to the ones in Little Smalltalk, for example.
Will images be compatible between the JIT VM and the Stack VM? Or do you expect the latter to not be used anymore once the JIT is available? I had originally understood that the Stack VM would be compatible with older images (since you divorce all frames on save and remarry them on reload) but I had missed the detail of the different bytecodes for variable instance access in the case of Context objects.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
Well, I am trying to avoid having a flags word even though in hardware it is so easy to have any size fields that you might want. I can check if context == nil very efficiently. For methods, t0 is the same value as the "self" register (x4, for example) while for blocks it is different. And with three pointers (fp, sp and control pointer) I shouldn't need to keep track of the number of arguments.
Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
Hmmm... it has been a while since I designed hardware with interrupts, but have normally used Alto style coroutines instead. The stack cache is divided up into 32 word blocks and can hold parts of stacks from several threads at once. Checking for overflow/underflow only needs to happen when the stack pointer moves from one block to a different one (and even then, only in certain cases which aren't too common). An interesting feature of this scheme is that only 5 bit adders are needed (which are much faster than 16 or 32 bit adders, for example. Wide adders could reduce the clock speed or make the operation take an extra clock). Another detail is that having operand or control frames split among two stack pages is no problem at all.
address of tN in the stack cache:
raddr := fp[4:0] + N. scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].
When fp[5] changes value, then tLow := tHigh and tHigh := head of free block list (if fp was going up). If there are no free blocks, then some have to be flushed to their stack pages in main memory. When going down, tLow is loaded from a linked list, which might have to be extended by loading blocks from stack pages. With a 4KB stack cache, for example, there are 32 blocks with 32 words each and so block 0 can dedicate a word for each of the other 31 blocks. The bottom 7 bits of that word (only 5 actually needed, but it is nice to have a little room to grow) can form the "previous" linked list (tHigh and tLow would also be 5 bits wide) while the remaining bits can hold the block's address in main memory.
This might seem far more complicated than a split arg/temp frame and it certainly would be if implemented in software. In hardware, it is mostly wires, a multiplexer and a small adder.
-- Jecel
vm-dev@lists.squeakfoundation.org