[Vm-dev] SiliconSqueak and RISC-V J Extension

Jecel Assumpcao Jr. jecel at merlintec.com
Thu Mar 29 00:09:29 UTC 2018


Eliot,

thank you for your comments.

> > [macros and gathering data]
> 
> One thing that should be straight-forward to do is to modify the
> simulator to collect instructions and attribute them to specific tasks.

You mean individual instructions or short sequences of instructions?
Perhaps we are talking about different things. What I meant was that
watching Clement's video about fixing a bug in the VM I noticed that the
code for #new had been inlined into the method he was looking at. That
same short sequence probably shows up in a lot of places. I want to know
what percent of the time is spent in all such sequences added together.
If it turns out to be 0.03% of the time then special hardware to make
#new faster would be a bad idea.

Most of the data in the green book was easy to get because something
like #new would be a subroutine and normal profiling tools can tell you
how much time is spent in a given subroutine.

> This wouldn't give you cycle counts for each kind of operation but
> one could augment the instruction counts with max and min cycle
> counts for a range quite easily (but getting that cycle data is tedious
> without a convenient online form).  I don't think that Bochs has any
> support for accurate cycle counts, or page faults etc.  But I think
> instruction counts would be very interesting.  Ds this sound worthwhile
> to you?

The simulators I have written so far also suppose infinite caches and so
on. I am going to fix this on my next one. The x86 is pretty complicated
(which is why you adopted Bochs in the first place, right?) but an ARM
or RISC-V simulator in Squeak that we can more easily instrument should
be doable. But that is at the user level - page faults and such require
simulating the supervisor level and an OS.

About instruction counts, they are certainly very important even if less
helpful than cycle counts.
 
> > [BCTableBase + (* (char *) IP++) << 4]
>
> Ah, that's nice.  So BCTableBase can be used to implement multiple
> bytecode sets right?  What support is there for setting BCTableBase,
> e.g. on return?

I have not yet looked at how multiple bytecodes set are handled in Cog.
What I proposed is enough to let different threads have different
bytecodes sets at least. Changing bytecode sets on send and return would
mean explcitly changing BCTableBase in the prolog/epilog sequences of
each method and that might cause some thrashing in the instruction
cache.

> > [extra bit in registers and stack]
> 
> Some examples would make this clearer.  I don't understand the extra
> tag bit.  I find a single tag bit restrictive.  Is the scheme more general
> than I'm presuming?

This makes it a three level scheme:

<32 bits> 0 = raw values (like in a ByteArray or Bitmap)
<32 bits> 1 = see tag in low bits:
    <31 bits> 1 1 = SmallInteger
    <30 bits> 10 1 = Character
    <30 bits> 00 1 = oop - see object header for more details

The 33rd bit shown above doesn't exist in memory or cache, but is
created by load instructions (which must know which addresses have
tagged values and which don't, which is pretty complicated in the normal
Squeak image formats) and is checked by the store instructions. Each
integer register has an associated 33rd bit and that is saved and
restored as it is spilled to the stack.

When using this mode you can convert a SmallInteger to a 32 bit raw word
and the other way around (if it was actually a 31 bit value) but you
can't manipulate an oop in any way at the user level (you can trap to
the supervisor level to have that do it for you).

Note that there is a tagged RISC-V implementation and a second one that
is migrating from MIPS64 to RISC-V and both store the extra bits for
each word in memory:

http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/
https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

Both of these solutions are trying to improve the security of C code.
Things are simpler for us. Note that this option wouldn't be used for V3
or Spur - a secure VM needs this plus a few other things (like checking
that an image hasn't been changed by some external program) to make any
sense (so it would have to be signed or something like that).
 
> > Besides these four extensions, there is a Xteam extension which allows a
> > group of cores to work together on a single piece of code. Using this
> > will require adding a new compiler to Cog, so I won't go into it here.
> 
> I wonder if this will fit with Clément's new PhD student's work on extending
> Cog for vector instructions and with Ronie's Lowcode work.  We also want
> to extend the compiler, in this case to include vector support. 

It would probably work great! But the standard V extension is pretty
good too and might be a better option. Let me go into the history of
this:

One thing I wanted to do when SiliconSqueak started in 2009 was to take
advantage of the fact it was going to be implemented in an FPGA to add
an extra level of compilation so that the most critical loops would get
converted to hardware. I had helped a friend with his PhD in the 1990s
that was a step in this direction (his project was called SelfHDL).
Unfortunately the secretive nature of FPGA tools made this very hard to
do. In 2010 I came up with the idea of an ALU Matrix coprocessor with,
for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange
data with their neighbors. So the extra compiler could target that
instead of trying to generate hardware and I could still take advantage
of the FPGA by changing from few cores with coprocessors to more cores
without depending on the needs of the compiled code.

> https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration

A big problem was that the coprocessor had a lot of internal state and
also a big program memory which made sharing it between different tasks
very costly (this was the worst problem of the Intel i860). In 2016
(after my PhD deadline had passed anyway) I worked on this by making the
ALU matrix an independent processor with an instruction cache instead of
manually loaded instruction memory (even though the Cell processors,
which were similar, turned out to be very hard to program). In 2017 I
worked on vector registers very much like the RISC-V V Extension. Some
of the code that the ALU Matrix could handle was not very regular and
the vectors don't help in those cases.

So at the end of 2017 I came up with a way to make several simple cores
work together when needed but go their own way otherwise. This would
work with fixed hardware instead of changing things in an FPGA. Not as
interesting for a PhD but better for a commercial product. The idea is
that you can map some registers to channels like in the old Transputer.
So you might say that register x12 in core 7 goes to register x10 in
core 11. Trying to read from a channel that hasn't been written to yet
will block as will trying to write a second time to a channel which
hasn't read the previous value.

Running the same code on all cores will be a good aproximation of a
vector system, but more costly as each core has its own control logic,
PC and so on. But you might compile a basic block to run on a team of 6
cores, for example, and have each core do a different things. This is
the software equivalent of an out-of-order single core (since different
cores can move forward at different rates except when blocked by the
channels). Here I am thinking not only of an advanced Cog but what would
be good for Qemu or MAME. This is similar to the old RAW project from
MIT.

> My ideas are probably too naive, and too specific to the current Cog VM.  But
> they're simple things like- having a conditional move which doesn't raise an
> exception if it doesn't move data make writing the class test fast.  Unfortunately
> the x86/x86_64 conditional move raises an exception if given an invalid address
> when the condition is false, so one can't use it to do the "if the value is untagged, 
> fetch the class" operation without a jump, because when the value is tagged, even
> though no data is moved, an exception is raised.

I had no idea. The description of the instruction says it loads the
source into a temporary register and then if the condition is true it
stores the temporary register in the destination. So the trap happens
before the condition is even looked at.

Note that your use of "untagged" and mine above are very different. But
I got what you meant (oop).

> - having a status register, or status bits, or bits in the tagged arithmetic instruction
> itself, which define what is the valid tag pattern for tagged arithmetic instructions
> would be nice.  VW on SPARC never used the tagged arithmetic instructions because
> they mandated 00 as the tag pattern for tagged integers.  Sad.  Having a non-trapping
> tagged arithmetic instruction (add tagged and skip if untagged) would be nice
> (because traps are . 

In the first few versions of SiliconSqueak I had configurable tag
hardware. This is the description of the tagConfig register:

# Tag configuration defines the operation of the two detagging units
(associated with operands
# A and B) and the retagging unit (associated with the destination). 
The lowest 16 bits of the
# register indicate valid SmallInteger combinations of d31, d30, d1 and
d0.  The next higher 4
# bits are ANDed to d31, d30, d1 and d0 when converting from tagged to
untagged SmallInteger
# while 4 other bits are ORed to d31, d30, d1 and d0 when converting
from untagged to tagged
# SmallIntegers. 2 bits indicate how much to shift right when converting
from tagged to untagged
# SmallIntegers and the same bits indicate how much to shift left for
the reverse operation.  The
# top 6 bits are undefined.
#
#  For Squeak the bottom bit is set to 1 for SmallIntegers, so this
register must be set to hex
# value 011EAAAA. The AAAA defines all odd values as valid
SmallIntegers.  The E will clear
# d0 when converting to raw bits and the bottom 1 will set it when
retagging. The top 1 will divide
# the tagged value by 2 and multiply it back when retagging.  For Self
the bottom two bits are 0
# for SmallIntegers, so this register must be set to hex value 020F1111.
An option that works well
# in hardware but is complicated to deal with in software is when the
top two bits must match in
# SmallIntegers. This can be handled by setting this register to hex
value 000FF00F.

For the 2016 version I decided this was a costly complication and just
made it always use the last configuration (where the top two bits 00 or
11 indicate a SmallInteger). In software this is very complicated to do
but in hardware it is just a single XOR gate). This meant converting
when loading or saving V3 or Spur images, of course.

> Putting the immediate floating point encode/decode sequences in hardware would be nice.

With the D and F extensions you already have to let 32 and 64 bit floats
share the same registers. So it wouldn't add much to let tagged floats
be an option. One of the proposals is the L extension which would allow
decimal floats. That is a lot more complicated.

> But I'm intrigued by your statistics gathering suggestion.  It would be great to have
> finer grained stats.  Although I wonder how valid those stats would be in a processor
> with extensive support for OO.

Hardly my idea - one of the RISC-V creators wrote a book about using a
Quantative Approach to designs processors, after all. In fact, the
various papers and thesis published for RISC-III (SOAR) are a great
example of what we need to do. And it showed how following your
intuition can have bad results. For example: all their instructions were
tagged but in the end only the tagged ADD made the slightest difference.
They also automatically filled registers with Nil but it didn't help.

They didn't have the statistics gathering I want, however. What they did
was to turn features on and off in their compiler and look at total
execution time for a bunch of benchmarks.

In my case it would be like running the benchmarks where we take
advantage of using x15 to push/pop stuff and then run them again but
with explicit stack pointer manipulation sequences. That just gives you
an overall number.

-- Jecel


More information about the Vm-dev mailing list