[Vm-dev] SiliconSqueak and RISC-V J Extension

Thu Mar 29 01:09:29 UTC 2018

Hi Jecel,

> On Mar 28, 2018, at 5:09 PM, Jecel Assumpcao Jr. <jecel at merlintec.com> wrote:
> 
> 
> Eliot,
> 
> thank you for your comments.
> 
>>> [macros and gathering data]
>> 
>> One thing that should be straight-forward to do is to modify the
>> simulator to collect instructions and attribute them to specific tasks.
> 
> You mean individual instructions or short sequences of instructions?
> Perhaps we are talking about different things. What I meant was that
> watching Clement's video about fixing a bug in the VM I noticed that the
> code for #new had been inlined into the method he was looking at. That
> same short sequence probably shows up in a lot of places. I want to know
> what percent of the time is spent in all such sequences added together.

Exactly.

> If it turns out to be 0.03% of the time then special hardware to make
> #new faster would be a bad idea.

Right, exactly.  We want to know what is cheap on current hardware and what is expensive, and what could be made cheaper.

> Most of the data in the green book was easy to get because something
> like #new would be a subroutine and normal profiling tools can tell you
> how much time is spent in a given subroutine.

Right.  The Cog JIT generates abstract instructions in an assembly-like style.  We can add state that the abstract instructions easily.  So if we modified some generation routines to set a "context" flag, such as "doing allocation", "doing dispatch", "doing store check", "doing marshalling", etc, we could label each instruction in the sequences we're interested in with that flag. Then, for example, when we generate we could reduce that flag to a set of bit vectors, one bit per byte of instruction space, one bit vector per "interesting code type", and one bit vector that ors them together.  Then on simulating each instruction we can test its address in the bit vectors and find out what kind it is.  It will slow down simulation, but we're happy to pay the premium fees when we're gathering statistics.

> 
>>  This wouldn't give you cycle counts for each kind of operation but
>> one could augment the instruction counts with max and min cycle
>> counts for a range quite easily (but getting that cycle data is tedious
>> without a convenient online form).  I don't think that Bochs has any
>> support for accurate cycle counts, or page faults etc.  But I think
>> instruction counts would be very interesting.  Ds this sound worthwhile
>> to you?
> 
> The simulators I have written so far also suppose infinite caches and so
> on. I am going to fix this on my next one. The x86 is pretty complicated
> (which is why you adopted Bochs in the first place, right?

Right.  I had considered trying to generate a simulator given a semantics (e.g. as in the intel manual) but Josh Gargus made me see sense and pointed me at Bochs.

> ) but an ARM
> or RISC-V simulator in Squeak that we can more easily instrument should
> be doable. But that is at the user level - page faults and such require
> simulating the supervisor level and an OS.
> 
> About instruction counts, they are certainly very important even if less
> helpful than cycle counts.

Useful enough data for not too much effort.  Very not accurate cycle counts will be a lot more work and much harder set to prove correct.  In any case they depend on processor implementation and given the those implementations evolve quickly I have never thought of worthwhile doing micro measurements to find out what are fast instruction sequences on specific versions.  People do do this and get great results.  I've simply never worked in a situation where I felt I could afford the effort.

>  
>>> [BCTableBase + (* (char *) IP++) << 4]
>> 
>> Ah, that's nice.  So BCTableBase can be used to implement multiple
>> bytecode sets right?  What support is there for setting BCTableBase,
>> e.g. on return?
> 
> I have not yet looked at how multiple bytecodes set are handled in Cog.

Claus's scheme is to maintain a single variable (at interpretation and JIT time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc, and this is added to the byte fetched.  bytecodeSetOffset must be set on activating a method and returning from one.  It is essentially the same idea as maintaining BCTableBase as a variable.

> What I proposed is enough to let different threads have different
> bytecodes sets at least. Changing bytecode sets on send and return would
> mean explcitly changing BCTableBase in the prolog/epilog sequences of
> each method and that might cause some thrashing in the instruction
> cache.

It might be worth the cost :-). And lookahead logic might reduce the cost.  But I have no real idea :-)

> 
>>> [extra bit in registers and stack]
>> 
>> Some examples would make this clearer.  I don't understand the extra
>> tag bit.  I find a single tag bit restrictive.  Is the scheme more general
>> than I'm presuming?
> 
> This makes it a three level scheme:
> 
> <32 bits> 0 = raw values (like in a ByteArray or Bitmap)
> <32 bits> 1 = see tag in low bits:
>    <31 bits> 1 1 = SmallInteger
>    <30 bits> 10 1 = Character
>    <30 bits> 00 1 = oop - see object header for more details
> 
> The 33rd bit shown above doesn't exist in memory or cache, but is
> created by load instructions (which must know which addresses have
> tagged values and which don't, which is pretty complicated in the normal
> Squeak image formats) and is checked by the store instructions. Each
> integer register has an associated 33rd bit and that is saved and
> restored as it is spilled to the stack.
> 
> When using this mode you can convert a SmallInteger to a 32 bit raw word
> and the other way around (if it was actually a 31 bit value) but you
> can't manipulate an oop in any way at the user level (you can trap to
> the supervisor level to have that do it for you).
> 
> Note that there is a tagged RISC-V implementation and a second one that
> is migrating from MIPS64 to RISC-V and both store the extra bits for
> each word in memory:
> 
> http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/
> https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
> 
> Both of these solutions are trying to improve the security of C code.
> Things are simpler for us. Note that this option wouldn't be used for V3
> or Spur - a secure VM needs this plus a few other things (like checking
> that an image hasn't been changed by some external program) to make any
> sense (so it would have to be signed or something like that).
> 
>>> Besides these four extensions, there is a Xteam extension which allows a
>>> group of cores to work together on a single piece of code. Using this
>>> will require adding a new compiler to Cog, so I won't go into it here.
>> 
>> I wonder if this will fit with Clément's new PhD student's work on extending
>> Cog for vector instructions and with Ronie's Lowcode work.  We also want
>> to extend the compiler, in this case to include vector support. 
> 
> It would probably work great! But the standard V extension is pretty
> good too and might be a better option. Let me go into the history of
> this:
> 
> One thing I wanted to do when SiliconSqueak started in 2009 was to take
> advantage of the fact it was going to be implemented in an FPGA to add
> an extra level of compilation so that the most critical loops would get
> converted to hardware. I had helped a friend with his PhD in the 1990s
> that was a step in this direction (his project was called SelfHDL).
> Unfortunately the secretive nature of FPGA tools made this very hard to
> do. In 2010 I came up with the idea of an ALU Matrix coprocessor with,
> for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange
> data with their neighbors. So the extra compiler could target that
> instead of trying to generate hardware and I could still take advantage
> of the FPGA by changing from few cores with coprocessors to more cores
> without depending on the needs of the compiled code.
> 
>> https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration
> 
> A big problem was that the coprocessor had a lot of internal state and
> also a big program memory which made sharing it between different tasks
> very costly (this was the worst problem of the Intel i860). In 2016
> (after my PhD deadline had passed anyway) I worked on this by making the
> ALU matrix an independent processor with an instruction cache instead of
> manually loaded instruction memory (even though the Cell processors,
> which were similar, turned out to be very hard to program). In 2017 I
> worked on vector registers very much like the RISC-V V Extension. Some
> of the code that the ALU Matrix could handle was not very regular and
> the vectors don't help in those cases.
> 
> So at the end of 2017 I came up with a way to make several simple cores
> work together when needed but go their own way otherwise. This would
> work with fixed hardware instead of changing things in an FPGA. Not as
> interesting for a PhD but better for a commercial product. The idea is
> that you can map some registers to channels like in the old Transputer.
> So you might say that register x12 in core 7 goes to register x10 in
> core 11. Trying to read from a channel that hasn't been written to yet
> will block as will trying to write a second time to a channel which
> hasn't read the previous value.
> 
> Running the same code on all cores will be a good aproximation of a
> vector system, but more costly as each core has its own control logic,
> PC and so on. But you might compile a basic block to run on a team of 6
> cores, for example, and have each core do a different things. This is
> the software equivalent of an out-of-order single core (since different
> cores can move forward at different rates except when blocked by the
> channels). Here I am thinking not only of an advanced Cog but what would
> be good for Qemu or MAME. This is similar to the old RAW project from
> MIT.
> 
>> My ideas are probably too naive, and too specific to the current Cog VM.  But
>> they're simple things like- having a conditional move which doesn't raise an
>> exception if it doesn't move data make writing the class test fast.  Unfortunately
>> the x86/x86_64 conditional move raises an exception if given an invalid address
>> when the condition is false, so one can't use it to do the "if the value is untagged, 
>> fetch the class" operation without a jump, because when the value is tagged, even
>> though no data is moved, an exception is raised.
> 
> I had no idea. The description of the instruction says it loads the
> source into a temporary register and then if the condition is true it
> stores the temporary register in the destination. So the trap happens
> before the condition is even looked at.
> 
> Note that your use of "untagged" and mine above are very different. But
> I got what you meant (oop).
> 
>> - having a status register, or status bits, or bits in the tagged arithmetic instruction
>> itself, which define what is the valid tag pattern for tagged arithmetic instructions
>> would be nice.  VW on SPARC never used the tagged arithmetic instructions because
>> they mandated 00 as the tag pattern for tagged integers.  Sad.  Having a non-trapping
>> tagged arithmetic instruction (add tagged and skip if untagged) would be nice
>> (because traps are . 
> 
> In the first few versions of SiliconSqueak I had configurable tag
> hardware. This is the description of the tagConfig register:
> 
> # Tag configuration defines the operation of the two detagging units
> (associated with operands
> # A and B) and the retagging unit (associated with the destination). 
> The lowest 16 bits of the
> # register indicate valid SmallInteger combinations of d31, d30, d1 and
> d0.  The next higher 4
> # bits are ANDed to d31, d30, d1 and d0 when converting from tagged to
> untagged SmallInteger
> # while 4 other bits are ORed to d31, d30, d1 and d0 when converting
> from untagged to tagged
> # SmallIntegers. 2 bits indicate how much to shift right when converting
> from tagged to untagged
> # SmallIntegers and the same bits indicate how much to shift left for
> the reverse operation.  The
> # top 6 bits are undefined.
> #
> #  For Squeak the bottom bit is set to 1 for SmallIntegers, so this
> register must be set to hex
> # value 011EAAAA. The AAAA defines all odd values as valid
> SmallIntegers.  The E will clear
> # d0 when converting to raw bits and the bottom 1 will set it when
> retagging. The top 1 will divide
> # the tagged value by 2 and multiply it back when retagging.  For Self
> the bottom two bits are 0
> # for SmallIntegers, so this register must be set to hex value 020F1111.
> An option that works well
> # in hardware but is complicated to deal with in software is when the
> top two bits must match in
> # SmallIntegers. This can be handled by setting this register to hex
> value 000FF00F.
> 
> For the 2016 version I decided this was a costly complication and just
> made it always use the last configuration (where the top two bits 00 or
> 11 indicate a SmallInteger). In software this is very complicated to do
> but in hardware it is just a single XOR gate). This meant converting
> when loading or saving V3 or Spur images, of course.
> 
>> Putting the immediate floating point encode/decode sequences in hardware would be nice.
> 
> With the D and F extensions you already have to let 32 and 64 bit floats
> share the same registers. So it wouldn't add much to let tagged floats
> be an option. One of the proposals is the L extension which would allow
> decimal floats. That is a lot more complicated.
> 
>> But I'm intrigued by your statistics gathering suggestion.  It would be great to have
>> finer grained stats.  Although I wonder how valid those stats would be in a processor
>> with extensive support for OO.
> 
> Hardly my idea - one of the RISC-V creators wrote a book about using a
> Quantative Approach to designs processors, after all. In fact, the
> various papers and thesis published for RISC-III (SOAR) are a great
> example of what we need to do. And it showed how following your
> intuition can have bad results. For example: all their instructions were
> tagged but in the end only the tagged ADD made the slightest difference.
> They also automatically filled registers with Nil but it didn't help.
> 
> They didn't have the statistics gathering I want, however. What they did
> was to turn features on and off in their compiler and look at total
> execution time for a bunch of benchmarks.
> 
> In my case it would be like running the benchmarks where we take
> advantage of using x15 to push/pop stuff and then run them again but
> with explicit stack pointer manipulation sequences. That just gives you
> an overall number.
> 
> -- Jecel