RISC42 history (was: Assembly Language) - Hardware

9 Oct 2007


      On Mon, 17 Sep 2007 21:10:44 -0300 I wrote:
...
The
idea is to make this development as open as possible, with a public
version control system, a bug tracker and a blog. This would be a good
thing to implement in Seaside but perhaps I should start out with
existing solutions to get results faster?
I am thinking that it would be a good idea to create a "Plurion Maker"
application in Squeak rather than implementing this system directly in
Verilog (or VHDL, which I don't like as much but is popular in Europe
and in schools). You would set a few options with a nice interface and
then press a button to get one big HDL file plus one or two little
configuration ones.
This is much more work to implement initially, but much nicer to use in
the long run. Other processor cores have similar systems, sometimes
implemented as a bunch of Perl and Tcl scripts. On the Smalltalk side,
we have Idass as an inspiration (though this would be very specialized
compared to that):
http://www.xs4all.nl/~averschu/idass/
...
It would be interesting if people could look at the (extremely bare,
sorry) description of the instruction set (Matthew gave the link in his
email, but here it is again - http://www.merlintec.com:8080/hardware/32)
and give their opinions.
The processor design has been changing, so people who looked at it a
while back might not have seen the recent ideas. And it is hard to have
an opinion without knowing its history:
I started out (1998) trying to build the ideal hardware target for a
Self-style adaptive compilation system. The result was Tachyon which
could execute four MOVE instructions per clock as long as the compiler
could schedule them. It had an interesting support for PICs with up to
three entries. There was dedicated hardware for the translation cache,
so this was very similar to what Transmeta later released.
By 2001 it was becoming obvious that the software side of the project
was very large, and with new FPGAs coming out more internal memory it
became pratical to implement several simpler processors (Plurion) and
get roughly the same performance as one big one. And having bytecodes as
the native instruction set of these processors the software side would
be more comparable to an interpreter. This would not rule out
sophisticated optimization later on, though targeting bytecodes would
make the adaptive compiler more like AOStA
(http://www.esug.org/data/ESUG2003/aosta.pdf) than Self, which is
actually a good thing.
Even though the bytecode processors were very similar to Forth processor
which are very good for low level computing, I felt that a more
RISC-like processor would do a better job of handling i/o. So I tried
out various designs and the idea was that a Plurion system would be a
mix of stack processors for executing user code and RISC-like processors
for doing i/o. Having implemented things like serial ports in hardware I
had been impressed by how quickly such things grow as you add more and
more counters an registers. A processing implementing the same thing in
software could be smaller since on board memory is so much denser than
random logic in modern FPGAs and a single ALU would be time multipled to
handle as many counters as you need.
In parallel with these two design efforts (bytecode processor and i/o
processor) I also looked at the possibility of doing a microcoded
processor for running Squeak. My own bytecodes were so much simpler than
Squeak's that they could be directly executed, but it was obvious that a
proper Squeak processor would look more like JOP (a Java processor
implemented in FPGAs - http://www.jopdesign.com/). The old Xerox D
machines and the Lisp Machines were also a good inspiration for this, of
course, but the pipelined design used in JOP is more interesting. I did
think that JOP was trying to save too much memory: the first pipeline
stage looks up the fetched bytecode in a 256 entry memory to get the
start address of the first microinstruction, just like the Dorado did.
Why not just waste a few words for each instruction and avoid an
indirection? Just jump directly to microinstruction BC*4 after fetching
bytecode BC and if you need more than four microinstructions you include
a jump (this will be a slow instruction anyway, so the jump won't hurt
much). Most bytecodes execute in one or two microinstructions and you
could use the leftover words for those entries that are longer. You can
also make the microcode memory larger than 1024 words and put the rest
of the large instructions in this overflow area.
When RISCs first came out I used to joke that they were just CISCs that
used main memory as their microcode RAM. Adding instruction caches made
the two schemes even more similar. As the three designs (hardwired
bytecode stack processor, risc-like i/o processor, microcoded Squeak
processor) evolved, I started thinking about taking this joke seriously
and creating a single merged design. A special small "bytecode mode" RAM
would dispatch efficiently to small RISC routines and the instruction
cache would hold code for longer bytecodes and native code for stuff
written in C and other languages. The instructions were kept short to
make compiling from bytecodes to native code less costly in terms of
memory and among the many inspirations were Jan Gray's desings and the
Data General Nova. The cascade instructions were introduced to make use
of the bypass hardware that all pipelined RISCs have anyway. They help
out when two address instructions would be awkward (they are often, but
not always, good enough). The skip instructions were inspired by the old
IBM Stretch (early 1960s supercomputer) design though they are rather
different from what had been done there.
The stack processors had inherited the PIC hardware from Tachyon. RISC42
got a more general PICMode extension to the instruction cache which is
not limited to three entries (it has no limit at all) as in previous
designs. All of my designs have had a Mushroom-style virtually addressed
data cache so that object tables could be used without the usual
overhead (tables are accessed on cache misses, nor on every object field
access).
Originally RISC42 only had 16 registers, but it was obvious that
register windows would make non leaf methods much more efficient and
would be a better use of the available FPGA resources. Many people got a
bad impression of register windows due to design problems on the Sparc
(fixed only in the UltraSparc), but the Altera people said of removing
register windows from their Nios II softcore (the original Nios had
them) that "it didn't hurt performance too much" so they felt the
simplification was worth it. Unlike the Sparc, the register windows in
RISC42 are allocated from a list so you can have several threads mixed
in the physical registers. With register windows RISC42 became
multithreaded like the i/o processors and the Alto since that has worked
well for me so far.
The next significant change was removing the size of the operand from
the store/load instructions and into the pointer itself in the form of
tag bits. Though the definition of the C language was created to handle
this (that is why you have to cast from a char pointer to a long
pointer) most C compilers I have looked at (lcc, tcc and others) have no
ways of dealing with this kind of thing since the PDP-11 didn't need it
and all modern processors have evolved from it. Looking at gcc it seems
that this feature can be added, but until I have actually done it I
can't be sure. Handling cascade instructions and skips won't be easy
either, but these features can be ignored and it is still possible to
generate code for any C application using the rest of the instruction
set. The tagged pointers are more fundamental, so the compiler will have
to deal with them. But they are important for making Smalltalk and C
play nice with each other, so I will put in the effort to make this
happen.
One nice thing about the old Plurion stack processor is that returns
would put the data in the right place for an argument to a send
instruction later. In normal Smalltalk implementations you have to keep
copying things from the caller stack to the callee stack and then back
(the result) again. In addition, it was not easy to implement the most
common Squeak bytecodes as just one or two RISC42 instructions. So
register 15 was redefined to operate as the top of a small hardware
stack. This allows efficient implementation of common bytecodes, access
to three register windows at once instead of just two and a lot less
copying by compiling returns to put their data directly in their final
destinations. Since there is now a special register, there might as well
be two: register 14 was redefined as always returning zero. Several
instruction sequences were awkward without this, and while it could have
been just a software convention to keep R14 always zero there would not
be any place to send results that aren't needed. This cut the global
registers from four to just two, but that should be enough.
Even though the register windows only have six registers each, that is
enough for most non leaf methods. Leaf methods can use two register
windows at once, and with the new stack they have some extra room to
keep their data in. With agressive inlining, however, highly optimized
methods will need to access more registers. Since the prefix
instructions were only defined for immediate operands, their use with
non immediate instructions has been defined to extend the destination
and source fields. This allows up to 256 extra local registers to be
used. The prolog and epilog code for such methods will not be trivial as
they will have to deal with allocating a continuous chunk of physical
registers among the various threaded lists of frames. But this shouldn't
be a problem as such methods will execute for quite a while (or they
wouldn't have been compiled with so much inling in the first place)
making this overhead worth it.
And this is what I have today and am starting to implement. It is a bit
more complex than I would like, but I feel the extra features are
important for it to do well interpreting bytecodes, running highly
factored native code and running deeply inlined native code.
-- Jecel