[Vm-dev] SiliconSqueak and RISC-V J Extension

Wed Mar 28 21:50:53 UTC 2018

Hi Jecel,

On Wed, Mar 28, 2018 at 2:25 PM, Jecel Assumpcao Jr. <jecel at merlintec.com>
wrote:

>
> Some of you know that I have been working on the design of the
> SiliconSqueak processor which is optimized for the OpenSmalltalk VM. I
> am currently redesigning it to be a proper extension to the RISC-V
> standard instruction set (https://riscv.org/). While the result will not
> be as technically elegant as previous versions, it should be much more
> interesting commercially.
>
> RISC-V was started at Berkeley in 2010 and has a growing community which
> I expect will make it the third most important instruction set after the
> x86 and ARM very soon. It was created to be expandable, so you start
> with one of the four integer instruction subsets (RV32I, RV64I, RV128I
> or RV32E with only 16 registers for embedded systems) and then you can
> optionally have some of the standard extensions:
>
> http://linuxgizmos.com/files/riscv_ref_card1.jpg
> http://linuxgizmos.com/files/riscv_ref_card2.jpg
>
> Integer Multiplication and Division (M)
> Atomics (A)
> Single-Precision Floating-Point (F)
> Double-Precision Floating-Point (D)
>
> The IMAFD combination is considered to be popular enough that you can
> use G (for General) instead. So RV32G is the same thing as RV32IMAFD.
> Some more standard extensions are in the process of being defined:
>
> Quad-Precision Floating-Point (Q)
> Decimal Floating-Point (L)
> 16-bit Compressed Instructions (C)
> Bit Manipulation (B)
> Dynamic Languages (J)
> Transactional Memory (T)
> Packed-SIMD Extensions (P)
> Vector Extensions (V)
> User-Level Interrupts (N)
>
> Non standard extensions are also allowed. For example: Xhwacha means the
> processor has the Hwacha vector extension that is different from the V
> vector extension that is being defined. So a processor named RV32GXsisq
> would be a general 32 bit RISC-V with SIliconSQueak extensions (which I
> will describe below).
>
> I am in the process of becoming a RISC-V Foundation member so I have
> join the J Extension work group since I feel I can help though the "J"
> is meant to imply Java and Javascript. It would also be a good idea if
> any needless incompatiblity between Xsisq and J can be avoided.
>
> My goal is to both improve the performance and efficiency of the
> bytecode interpreter and make the processor a better target for adaptive
> compilation. The ARM folks also did this in Jazalle 1 (later renamed DBX
> - direct bytecode execution) and Jazelle 2 (renamed RCT - runtime
> compilation target and later Thumb EE). I want to modify Cog to both
> work with RV32G and with my extensions and have the simulator interface
> with a cycle accurate processor simulator that can let us measure the
> effects of the extensions for both simple implementations and fancier
> out-of-order execution engines.
>
> It would be nice to have data to guide the design of these extensions
> before all this work, but the fact that critical operations are done via
> macros in the OpenSmalltalk VM (as they should!) instead of subroutines
> make it hard to know how much time is spent on PICs or allocating new
> objects, for example.
>

One thing that should be straight-forward to do is to modify the simulator
to collect instructions and attribute them to specific tasks.  This
wouldn't give you cycle counts for each kind of operation but one could
augment the instruction counts with max and min cycle counts for a range
quite easily (but getting that cycle data is tedious without a convenient
online form).  I don't think that Bochs has any support for accurate cycle
counts, or page faults etc.  But I think instruction counts would be very
interesting.  Ds this sound worthwhile to you?

>
> == Xsisqbytecodes
>
> This is a special execution mode which uses two extra registers: IP
> pointing to the next bytecodes and BCTableBase which points to an
> aligned 1024 byte table in memory. The instruction at BCTableBase + (*
> (char *) IP++) << 4 is executed. If the instruction is some kind of
> branch then bytecode mode is exited, otherwise the next bytecode is
> fetched. Combined with the next extension, the most common bytecodes can
> be interpreted by a single RISC-V instruction.
>

Ah, that's nice.  So BCTableBase can be used to implement multiple bytecode
sets right?  What support is there for setting BCTableBase, e.g. on return?

== Xsisqstack
>
> Registers x16 to x31 are remapped to a moving window (not quite like
> RISC I, RISC II, RISC III (SOAR - Smalltalk On A RISC) or RISC IV
> (Spur)) but in the same spirit. In addition, register X15 becomes an
> alias for one of the others so you can push and pop to the stack
> implicitly instead of using extra instructions. When combined with the
> next extension, each register and word on the stack gets a 33rd (or 65th
> or 129th) bit that can't be changed by the software and which
> distinguishes between raw values and tagged values.
>

Some examples would make this clearer.  I don't understand the extra tag
bit.  I find a single tag bit restrictive.  Is the scheme more general than
I'm presuming?

== Xsisqobjectmemory
>
> Normal RISC-V load and store instructions generate a virtual address by
> adding a 12 bit immediate value to a register value. This virtual
> address is translated to a physical address by the MMU and that is used
> to access the cache. In the object addressing mode the value in the
> register is considered a virtual  object ID and the immediate value is
> an offset. They are not added but used separately by the cache hash to
> access the desired word. On a cache miss an object table is used to find
> the physical address.
>
> The object formats are known so loads and stores can tell raw values
> from tagged words. Both V3 and Spur images use direct pointers and so
> can't take advantage of this mode, though the RoarVM could probably be
> adapted since it uses object tables.
>
> == Xsisqpic
>
> A special PIC execution mode is entered with an instruction that reads
> an object's class. The instruction cache loads a line hashed by both the
> class and the current PC and the instructions there are executed until
> some branch happens. So it is like a combination of call and switch.
> This means that PICs take the same time no matter how many entries they
> have. If all levels of caches miss then this means that the compiler has
> to be called to create a new PIC entry for this PC/class combination.
>
> Besides these four extensions, there is a Xteam extension which allows a
> group of cores to work together on a single piece of code. Using this
> will require adding a new compiler to Cog, so I won't go into it here.
>

I wonder if this will fit with Clément's new PhD student's work on
extending Cog for vector instructions and with Ronie's Lowcode work.  We
also want to extend the compiler, in this case to include vector support.

>
> There are interesting possible extensions which I have not worked on:
>
> - support for GC (though Xsisqobjectmemory does help) like read or write
> barriers
> - support for execution counters
> - support of JIT like the hardware accelerators in this RISC-V to VLIW
> runtime compiler:
> > https://riscv.org/wp-content/uploads/2017/05/Wed1545-
> HybridDBT-Rokicki.pdf
>
> Any other ideas?
>

My ideas are probably too naive, and too specific to the current Cog VM.
But they're simple things like
- having a conditional move which doesn't raise an exception if it doesn't
move data make writing the class test fast.  Unfortunately the x86/x86_64
conditional move raises an exception if given an invalid address when the
condition is false, so one can't use it to do the "if the value is
untagged, fetch the class" operation without a jump, because when the value
is tagged, even though no data is moved, an exception is raised.

- having a status register, or status bits, or bits in the tagged
arithmetic instruction itself, which define what is the valid tag pattern
for tagged arithmetic instructions would be nice.  VW on SPARC never used
the tagged arithmetic instructions because they mandated 00 as the tag
pattern for tagged integers.  Sad.  Having a non-trapping tagged arithmetic
instruction (add tagged and skip if untagged) would be nice (because traps
are .  Putting the immediate floating point encode/decode sequences in
hardware would be nice.

But I'm intrigued by your statistics gathering suggestion.  It would be
great to have finer grained stats.  Although I wonder how valid those stats
would be in a processor with extensive support for OO.

-- Jecel
>

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180328/9c4f72e8/attachment.html>