[Vm-dev] measurements and multiple bytecode sets (was: SiliconSqueak and RISC-V J Extension)

Thu Mar 29 02:42:00 UTC 2018

Eliot,

> Right, exactly.  We want to know what is cheap on current hardware
> and what is expensive, and what could be made cheaper.

The last one is not easy. That is my complaint about Urs Hölzle's 1995
ECOOP paper where he concluded that the special hardware in Sparc didn't
help Self. Part of his results were due to how costly traps were on
Sparc 8 and Solaris (more than 1000 clock cycles) so using it for
handling tags and register windows overflow/underflow wasn't a good
idea. He compared the code generated by his Self compiler with that
generated by C and didn't see any difference (hardly surprising since
Sparc was optimized for C and his compiler was optimized for Sparc). The
problem is that he couldn't measure the effect that any hardware
features Sparc didn't include would have if they were added. He could
make his compiler use or not register windows and report less then 1%
difference. But he couldn't guess what would happen if a PIC instruction
like I proposed were added.

http://www.cs.ucsb.edu/~urs/oocsb/papers/ecoop95-arch.pdf

> Right.  The Cog JIT generates abstract instructions in an assembly-like
> style.  We can add state that the abstract instructions easily.  So if we
> modified some generation routines to set a "context" flag, such as
> "doing allocation", "doing dispatch", "doing store check", "doing
> marshalling", etc, we could label each instruction in the sequences we're
> interested in with that flag.

Great idea! I had not considered looking at the Smalltalk code executing
while generating an instruction instead of looking at the generated
instructions. And I was thinking of Slang to C, which matters for the
Intepreter but not for code generated by Cog.

> Then, for example, when we generate we could reduce that flag to a
> set of bit vectors, one bit per byte of instruction space, one bit vector
> per "interesting code type", and one bit vector that ors them together.

If you have less than 8 interesting categories you could have one label
byte per one instruction byte. That would be easy but wasteful since
only the bits in the byte corresponding to the start of an instruction
would matter.

>  Then on simulating each instruction we can test its address in the bit
> vectors and find out what kind it is.  It will slow down simulation, but
> we're happy to pay the premium fees when we're gathering statistics.

You could increment counters corresponding to each of the bits.

> > About instruction counts, they are certainly very important even if less
> > helpful than cycle counts.
> 
> Useful enough data for not too much effort.  Very not accurate cycle
> counts will be a lot more work and much harder set to prove correct.
>  In any case they depend on processor implementation and given the
> those implementations evolve quickly I have never thought of
> worthwhile doing micro measurements to find out what are fast
> instruction sequences on specific versions.  People do do this and get
> great results.  I've simply never worked in a situation where I felt I could
> afford the effort.

For my own project I want to focus on very simple pipelined
implementations, so instruction counts would be good enough. But for the
J Extension work group we would need to know how the proposals would
affect a more advanced implementation like BOOM (Berkeley Out of Order
Machine).

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html

> > I have not yet looked at how multiple bytecodes set are handled in Cog.
> 
> Claus's scheme is to maintain a single variable (at interpretation and JIT
> time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc,
> and this is added to the byte fetched.  bytecodeSetOffset must be set
> on activating a method and returning from one.  It is essentially the
> same idea as maintaining BCTableBase as a variable.

It would be trivial to convert one to the other (just add or subtract a
constant). In fact, if you arrange it so that the table is at address
zero in code space then they are the same.

-- Jecel