Hi David,

    the difference looks to me to do with the fact that successFlag is flat and primErrorCode is in the VM struct.  Try generating a VM where either primFailCode is also flat or, better still, all variables are flat.  In my experience the flat form is faster on x86 (and faster with both the intel and gcc compilers; not tested with llvm yet).  BTW, if you use the Cog generator it'll generate accesses to variables which might be in the VM struct as GIV(theVariableInQuestion) (where GIV stands for global interpreter variable), and this allows one to choose whether these variables are kept in a struct or kept as separate variables at compile-time instead of generation time, as controlled by the USE_GLOBAL_STRUCT compile-time constant, e.g. gcc -DUSE_GLOBAL_STRUCT=0 gcc3x-interp.c.


On Sun, May 22, 2011 at 8:54 AM, David T. Lewis <lewis@mail.msen.com> wrote:

I have been trying to gradually update trunk VMMaker to better align
with oscog VMMaker (an admittedly slow process, but hopefully still
worthwhile).  I have gotten the interpreter primitives moved into class
InterpreterPrimitives and verified no changes to generated code. This
greatly reduces the clutter in class Interpreter, so it's a nice change
I think.

My next step was to update all of the primitives to use the #primitiveFailFor:
idiom, in which the successFlag variable is replaced with primFailCode
(integer value, 0 for success, 1, 2, 3... for failure codes). This would
get us closer to the point where the standard interpreter and stack/cog
would use a common set of primitives. A lot of changes were required for
this, but the resulting VM works fine ... except for performance.

On a standard interpreter, use of primFailCode seems to result in a
nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:

Standard interpreter (using successFlag):
 0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
 0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
 0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
 0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
 0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'

After updating the standard interpreter (using primFailCode):
 0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
 0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
 0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
 0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
 0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'

This is a much larger performance difference than I expected to see.
Actually I expected no measurable difference at all, and I was just
testing to verify this. But 12% is a lot, so I want to ask if I'm
missing something?

The changes to generated code generally take the form of:

Testing success status, original:
       if (successFlag) { ... }

Testing success status, new:
       if (foo->primFailCode == 0) { ... }

Setting failure status, original:
       successFlag = 0;

Setting failure status, new:
       if (foo->primFailCode == 0) {
               foo->primFailCode = 1;

My approach to doing the updates was as follows:
- Replace all occurrences of "successFlag := true" with "self initPrimCall",
 which initialize primFailCode to 0.
- Replace all "successFlag := false" with "self primitiveFail".
- Replace all "successFlag ifTrue: [] ifFalse: []" with
 "self successful ifTrue: [] ifFalse: []".
- Update #primitiveFail, #failed and #success: to use primFailCode rather
 than successFlag.
- Remove successFlag variable.

Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
an interp.c if anyone wants to see the gory details (It is too large to
post on this mailing list though).

Any advice appreciated. I suspect I'm missing something basic here.
