[Vm-dev] SiliconSqueak and RISC-V J Extension

Wed Mar 28 21:25:29 UTC 2018

Some of you know that I have been working on the design of the
SiliconSqueak processor which is optimized for the OpenSmalltalk VM. I
am currently redesigning it to be a proper extension to the RISC-V
standard instruction set (https://riscv.org/). While the result will not
be as technically elegant as previous versions, it should be much more
interesting commercially.

RISC-V was started at Berkeley in 2010 and has a growing community which
I expect will make it the third most important instruction set after the
x86 and ARM very soon. It was created to be expandable, so you start
with one of the four integer instruction subsets (RV32I, RV64I, RV128I
or RV32E with only 16 registers for embedded systems) and then you can
optionally have some of the standard extensions:

http://linuxgizmos.com/files/riscv_ref_card1.jpg
http://linuxgizmos.com/files/riscv_ref_card2.jpg

Integer Multiplication and Division (M)
Atomics (A)
Single-Precision Floating-Point (F)
Double-Precision Floating-Point (D)

The IMAFD combination is considered to be popular enough that you can
use G (for General) instead. So RV32G is the same thing as RV32IMAFD.
Some more standard extensions are in the process of being defined:

Quad-Precision Floating-Point (Q)
Decimal Floating-Point (L)
16-bit Compressed Instructions (C)
Bit Manipulation (B)
Dynamic Languages (J)
Transactional Memory (T)
Packed-SIMD Extensions (P)
Vector Extensions (V)
User-Level Interrupts (N)

Non standard extensions are also allowed. For example: Xhwacha means the
processor has the Hwacha vector extension that is different from the V
vector extension that is being defined. So a processor named RV32GXsisq
would be a general 32 bit RISC-V with SIliconSQueak extensions (which I
will describe below).

I am in the process of becoming a RISC-V Foundation member so I have
join the J Extension work group since I feel I can help though the "J"
is meant to imply Java and Javascript. It would also be a good idea if
any needless incompatiblity between Xsisq and J can be avoided.

My goal is to both improve the performance and efficiency of the
bytecode interpreter and make the processor a better target for adaptive
compilation. The ARM folks also did this in Jazalle 1 (later renamed DBX
- direct bytecode execution) and Jazelle 2 (renamed RCT - runtime
compilation target and later Thumb EE). I want to modify Cog to both
work with RV32G and with my extensions and have the simulator interface
with a cycle accurate processor simulator that can let us measure the
effects of the extensions for both simple implementations and fancier
out-of-order execution engines.

It would be nice to have data to guide the design of these extensions
before all this work, but the fact that critical operations are done via
macros in the OpenSmalltalk VM (as they should!) instead of subroutines
make it hard to know how much time is spent on PICs or allocating new
objects, for example.

== Xsisqbytecodes

This is a special execution mode which uses two extra regiters: IP
pointing to the next bytecodes and BCTableBase which points to an
aligned 1024 byte table in memory. The instruction at BCTableBase + (*
(char *) IP++) << 4 is executed. If the instruction is some kind of
branch then bytecode mode is exited, otherwise the next bytecode is
fetched. Combined with the next extension, the most common bytecodes can
be interpreted by a single RISC-V instruction.

== Xsisqstack

Registers x16 to x31 are remapped to a moving window (not quite like
RISC I, RISC II, RISC III (SOAR - Smalltalk On A RISC) or RISC IV
(Spur)) but in the same spirit. In addition, register X15 becomes an
alias for one of the others so you can push and pop to the stack
implicitly instead of using extra instructions. When combined with the
next extension, each register and word on the stack gets a 33rd (or 65th
or 129th) bit that can't be changed by the software and which
distinguishes between raw values and tagged values.

== Xsisqobjectmemory

Normal RISC-V load and store instructions generate a virtual address by
adding a 12 bit immediate value to a register value. This virtual
address is translated to a physical address by the MMU and that is used
to access the cache. In the object addressing mode the value in the
register is considered a virtual  object ID and the immediate value is
an offset. They are not added but used separately by the cache hash to
access the desired word. On a cache miss an object table is used to find
the physical address.

The object formats are known so loads and stores can tell raw values
from tagged words. Both V3 and Spur images use direct pointers and so
can't take advantage of this mode, though the RoarVM could probably be
adapted since it uses object tables.

== Xsisqpic

A special PIC execution mode is entered with an instruction that reads
an object's class. The instruction cache loads a line hashed by both the
class and the current PC and the instructions there are executed until
some branch happens. So it is like a combination of call and switch.
This means that PICs take the same time no matter how many entries they
have. If all levels of caches miss then this means that the compiler has
to be called to create a new PIC entry for this PC/class combination.

Besides these four extensions, there is a Xteam extension which allows a
group of cores to work together on a single piece of code. Using this
will require adding a new compiler to Cog, so I won't go into it here.

There are interesting possible extensions which I have not worked on:

- support for GC (though Xsisqobjectmemory does help) like read or write
barriers
- support for execution counters
- support of JIT like the hardware accelerators in this RISC-V to VLIW
runtime compiler:
> https://riscv.org/wp-content/uploads/2017/05/Wed1545-HybridDBT-Rokicki.pdf

Any other ideas?

-- Jecel