[Vm-dev] stack vm questions

Thu May 14 11:36:58 UTC 2009

Thanks Eliot and Igor for your comments!

Eliot is right that what Igor wrote is a good way to do a JIT but the
problem is that my hardware is essentially an interpreter and must deal
at runtime with expressions that the JIT would optimize away.

My design is described in:

http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details

What is missing from that description is what the control registers (x0
to x31) do, and that depends on the stack organization. Certainly it is
possible to have three registers as Eliot suggested. A separate control
stack is not only used on Forth processors but was also popular in Lisp
Machine designs.

Registers t0 to t29 are really implemented through a small amount of
hardware as a range of words in the stack cache. Having to split this
set differently for each method would make the hardware larger and
slower. With a separate control stack the mapping is really simple.

About the JIT having to use call which pushes the PC, that is not true
on RISC processors. But I suppose the focus for Cog is making Squeak
fast on the x86.

> There's a tension between implementing what the current compiler
> produces and implementing what the instruction set defines.  For
> example should one assume arguments are never written to?  I lean
> on the side of implementing the instruction set.

That is specially a good idea if the same VM ends up being used for
other languages like Newspeak. Certainly the Java VM runs many languages
that the original designers never expected.

> Yes.  In the JIT an interpreted frame needs an extra field to hold
> the saved bytecode instruction pointer when an interpreted frame
> calls a machine code frame because the return address is the "return
> to interpreter trampoline" pc.  There is no flag word in a machine
> code frame.  So machine code frames save one word w.r.t. the
> stack vm and interpreted frames gain a word.  But most frames
> are machine code ones so most of the time one is saving space.

Ok, so the JIT VM will still have an interpreter. Self originally didn't
have one but most of the effort that David Ungar put into the project
when it was restarted was making it more interpreter friendly. The
bytecodes became very similar to the ones in Little Smalltalk, for
example.

Will images be compatible between the JIT VM and the Stack VM? Or do you
expect the latter to not be used anymore once the JIT is available? I
had originally understood that the Stack VM would be compatible with
older images (since you divorce all frames on save and remarry them on
reload) but I had missed the detail of the different bytecodes for
variable instance access in the case of Context objects.

> I guess that in hardware you can create an instruction that will
> load a descriptor register as part of the return sequence in parallel
> with restoring the frame pointer and method so one would never
> indirect through the frame pointer to fetch the flags word; instead
> it would be part of the register state.  But that's an extremely
> uneducated guess :)

Well, I am trying to avoid having a flags word even though in hardware
it is so easy to have any size fields that you might want. I can check
if context == nil very efficiently. For methods, t0 is the same value as
the "self" register (x4, for example) while for blocks it is different.
And with three pointers (fp, sp and control pointer) I shouldn't need to
keep track of the number of arguments.

> Jecel can also design the machine to avoid taking interrupts on
> the operand stack and provide a separate interrupt stack.

Hmmm... it has been a while since I designed hardware with interrupts,
but have normally used Alto style coroutines instead. The stack cache is
divided up into 32 word blocks and can hold parts of stacks from several
threads at once. Checking for overflow/underflow only needs to happen
when the stack pointer moves from one block to a different one (and even
then, only in certain cases which aren't too common). An interesting
feature of this scheme is that only 5 bit adders are needed (which are
much faster than 16 or 32 bit adders, for example. Wide adders could
reduce the clock speed or make the operation take an extra clock).
Another detail is that having operand or control frames split among two
stack pages is no problem at all.

address of tN in the stack cache:

  raddr := fp[4:0] + N.
  scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].

When fp[5] changes value, then tLow := tHigh and tHigh := head of free
block list (if fp was going up). If there are no free blocks, then some
have to be flushed to their stack pages in main memory. When going down,
tLow is loaded from a linked list, which might have to be extended by
loading blocks from stack pages. With a 4KB stack cache, for example,
there are 32 blocks with 32 words each and so block 0 can dedicate a
word for each of the other 31 blocks. The bottom 7 bits of that word
(only 5 actually needed, but it is nice to have a little room to grow)
can form the "previous" linked list (tHigh and tLow would also be 5 bits
wide) while the remaining bits can hold the block's address in main
memory.

This might seem far more complicated than a split arg/temp frame and it
certainly would be if implemented in software. In hardware, it is mostly
wires, a multiplexer and a small adder.

-- Jecel