On Mon, 17 Sep 2007 21:10:44 -0300 I wrote:
The idea is to make this development as open as possible, with a public version control system, a bug tracker and a blog. This would be a good thing to implement in Seaside but perhaps I should start out with existing solutions to get results faster?
I am thinking that it would be a good idea to create a "Plurion Maker" application in Squeak rather than implementing this system directly in Verilog (or VHDL, which I don't like as much but is popular in Europe and in schools). You would set a few options with a nice interface and then press a button to get one big HDL file plus one or two little configuration ones.
This is much more work to implement initially, but much nicer to use in the long run. Other processor cores have similar systems, sometimes implemented as a bunch of Perl and Tcl scripts. On the Smalltalk side, we have Idass as an inspiration (though this would be very specialized compared to that):
http://www.xs4all.nl/~averschu/idass/
It would be interesting if people could look at the (extremely bare, sorry) description of the instruction set (Matthew gave the link in his email, but here it is again - http://www.merlintec.com:8080/hardware/32) and give their opinions.
The processor design has been changing, so people who looked at it a while back might not have seen the recent ideas. And it is hard to have an opinion without knowing its history:
I started out (1998) trying to build the ideal hardware target for a Self-style adaptive compilation system. The result was Tachyon which could execute four MOVE instructions per clock as long as the compiler could schedule them. It had an interesting support for PICs with up to three entries. There was dedicated hardware for the translation cache, so this was very similar to what Transmeta later released.
By 2001 it was becoming obvious that the software side of the project was very large, and with new FPGAs coming out more internal memory it became pratical to implement several simpler processors (Plurion) and get roughly the same performance as one big one. And having bytecodes as the native instruction set of these processors the software side would be more comparable to an interpreter. This would not rule out sophisticated optimization later on, though targeting bytecodes would make the adaptive compiler more like AOStA (http://www.esug.org/data/ESUG2003/aosta.pdf) than Self, which is actually a good thing.
Even though the bytecode processors were very similar to Forth processor which are very good for low level computing, I felt that a more RISC-like processor would do a better job of handling i/o. So I tried out various designs and the idea was that a Plurion system would be a mix of stack processors for executing user code and RISC-like processors for doing i/o. Having implemented things like serial ports in hardware I had been impressed by how quickly such things grow as you add more and more counters an registers. A processing implementing the same thing in software could be smaller since on board memory is so much denser than random logic in modern FPGAs and a single ALU would be time multipled to handle as many counters as you need.
In parallel with these two design efforts (bytecode processor and i/o processor) I also looked at the possibility of doing a microcoded processor for running Squeak. My own bytecodes were so much simpler than Squeak's that they could be directly executed, but it was obvious that a proper Squeak processor would look more like JOP (a Java processor implemented in FPGAs - http://www.jopdesign.com/). The old Xerox D machines and the Lisp Machines were also a good inspiration for this, of course, but the pipelined design used in JOP is more interesting. I did think that JOP was trying to save too much memory: the first pipeline stage looks up the fetched bytecode in a 256 entry memory to get the start address of the first microinstruction, just like the Dorado did. Why not just waste a few words for each instruction and avoid an indirection? Just jump directly to microinstruction BC*4 after fetching bytecode BC and if you need more than four microinstructions you include a jump (this will be a slow instruction anyway, so the jump won't hurt much). Most bytecodes execute in one or two microinstructions and you could use the leftover words for those entries that are longer. You can also make the microcode memory larger than 1024 words and put the rest of the large instructions in this overflow area.
When RISCs first came out I used to joke that they were just CISCs that used main memory as their microcode RAM. Adding instruction caches made the two schemes even more similar. As the three designs (hardwired bytecode stack processor, risc-like i/o processor, microcoded Squeak processor) evolved, I started thinking about taking this joke seriously and creating a single merged design. A special small "bytecode mode" RAM would dispatch efficiently to small RISC routines and the instruction cache would hold code for longer bytecodes and native code for stuff written in C and other languages. The instructions were kept short to make compiling from bytecodes to native code less costly in terms of memory and among the many inspirations were Jan Gray's desings and the Data General Nova. The cascade instructions were introduced to make use of the bypass hardware that all pipelined RISCs have anyway. They help out when two address instructions would be awkward (they are often, but not always, good enough). The skip instructions were inspired by the old IBM Stretch (early 1960s supercomputer) design though they are rather different from what had been done there.
The stack processors had inherited the PIC hardware from Tachyon. RISC42 got a more general PICMode extension to the instruction cache which is not limited to three entries (it has no limit at all) as in previous designs. All of my designs have had a Mushroom-style virtually addressed data cache so that object tables could be used without the usual overhead (tables are accessed on cache misses, nor on every object field access).
Originally RISC42 only had 16 registers, but it was obvious that register windows would make non leaf methods much more efficient and would be a better use of the available FPGA resources. Many people got a bad impression of register windows due to design problems on the Sparc (fixed only in the UltraSparc), but the Altera people said of removing register windows from their Nios II softcore (the original Nios had them) that "it didn't hurt performance too much" so they felt the simplification was worth it. Unlike the Sparc, the register windows in RISC42 are allocated from a list so you can have several threads mixed in the physical registers. With register windows RISC42 became multithreaded like the i/o processors and the Alto since that has worked well for me so far.
The next significant change was removing the size of the operand from the store/load instructions and into the pointer itself in the form of tag bits. Though the definition of the C language was created to handle this (that is why you have to cast from a char pointer to a long pointer) most C compilers I have looked at (lcc, tcc and others) have no ways of dealing with this kind of thing since the PDP-11 didn't need it and all modern processors have evolved from it. Looking at gcc it seems that this feature can be added, but until I have actually done it I can't be sure. Handling cascade instructions and skips won't be easy either, but these features can be ignored and it is still possible to generate code for any C application using the rest of the instruction set. The tagged pointers are more fundamental, so the compiler will have to deal with them. But they are important for making Smalltalk and C play nice with each other, so I will put in the effort to make this happen.
One nice thing about the old Plurion stack processor is that returns would put the data in the right place for an argument to a send instruction later. In normal Smalltalk implementations you have to keep copying things from the caller stack to the callee stack and then back (the result) again. In addition, it was not easy to implement the most common Squeak bytecodes as just one or two RISC42 instructions. So register 15 was redefined to operate as the top of a small hardware stack. This allows efficient implementation of common bytecodes, access to three register windows at once instead of just two and a lot less copying by compiling returns to put their data directly in their final destinations. Since there is now a special register, there might as well be two: register 14 was redefined as always returning zero. Several instruction sequences were awkward without this, and while it could have been just a software convention to keep R14 always zero there would not be any place to send results that aren't needed. This cut the global registers from four to just two, but that should be enough.
Even though the register windows only have six registers each, that is enough for most non leaf methods. Leaf methods can use two register windows at once, and with the new stack they have some extra room to keep their data in. With agressive inlining, however, highly optimized methods will need to access more registers. Since the prefix instructions were only defined for immediate operands, their use with non immediate instructions has been defined to extend the destination and source fields. This allows up to 256 extra local registers to be used. The prolog and epilog code for such methods will not be trivial as they will have to deal with allocating a continuous chunk of physical registers among the various threaded lists of frames. But this shouldn't be a problem as such methods will execute for quite a while (or they wouldn't have been compiled with so much inling in the first place) making this overhead worth it.
And this is what I have today and am starting to implement. It is a bit more complex than I would like, but I feel the extra features are important for it to do well interpreting bytecodes, running highly factored native code and running deeply inlined native code.
-- Jecel
hardware@lists.squeakfoundation.org