Eliot,
Right, exactly. We want to know what is cheap on current hardware and what is expensive, and what could be made cheaper.
The last one is not easy. That is my complaint about Urs Hölzle's 1995 ECOOP paper where he concluded that the special hardware in Sparc didn't help Self. Part of his results were due to how costly traps were on Sparc 8 and Solaris (more than 1000 clock cycles) so using it for handling tags and register windows overflow/underflow wasn't a good idea. He compared the code generated by his Self compiler with that generated by C and didn't see any difference (hardly surprising since Sparc was optimized for C and his compiler was optimized for Sparc). The problem is that he couldn't measure the effect that any hardware features Sparc didn't include would have if they were added. He could make his compiler use or not register windows and report less then 1% difference. But he couldn't guess what would happen if a PIC instruction like I proposed were added.
http://www.cs.ucsb.edu/~urs/oocsb/papers/ecoop95-arch.pdf
Right. The Cog JIT generates abstract instructions in an assembly-like style. We can add state that the abstract instructions easily. So if we modified some generation routines to set a "context" flag, such as "doing allocation", "doing dispatch", "doing store check", "doing marshalling", etc, we could label each instruction in the sequences we're interested in with that flag.
Great idea! I had not considered looking at the Smalltalk code executing while generating an instruction instead of looking at the generated instructions. And I was thinking of Slang to C, which matters for the Intepreter but not for code generated by Cog.
Then, for example, when we generate we could reduce that flag to a set of bit vectors, one bit per byte of instruction space, one bit vector per "interesting code type", and one bit vector that ors them together.
If you have less than 8 interesting categories you could have one label byte per one instruction byte. That would be easy but wasteful since only the bits in the byte corresponding to the start of an instruction would matter.
Then on simulating each instruction we can test its address in the bit vectors and find out what kind it is. It will slow down simulation, but we're happy to pay the premium fees when we're gathering statistics.
You could increment counters corresponding to each of the bits.
About instruction counts, they are certainly very important even if less helpful than cycle counts.
Useful enough data for not too much effort. Very not accurate cycle counts will be a lot more work and much harder set to prove correct. In any case they depend on processor implementation and given the those implementations evolve quickly I have never thought of worthwhile doing micro measurements to find out what are fast instruction sequences on specific versions. People do do this and get great results. I've simply never worked in a situation where I felt I could afford the effort.
For my own project I want to focus on very simple pipelined implementations, so instruction counts would be good enough. But for the J Extension work group we would need to know how the proposals would affect a more advanced implementation like BOOM (Berkeley Out of Order Machine).
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html
I have not yet looked at how multiple bytecodes set are handled in Cog.
Claus's scheme is to maintain a single variable (at interpretation and JIT time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc, and this is added to the byte fetched. bytecodeSetOffset must be set on activating a method and returning from one. It is essentially the same idea as maintaining BCTableBase as a variable.
It would be trivial to convert one to the other (just add or subtract a constant). In fact, if you arrange it so that the table is at address zero in code space then they are the same.
-- Jecel