[Win32] VM update (3.2 release candidate)

Mon May 6 21:48:00 UTC 2002

John,

> The change is to use a pointer to the GCC jumptable, versus the GCC 
> jumptable array itself. It removes for the powerpc 2 instructions per 
> jump, usually it would load the local address of the jumptable then 
> the index offset, now it just loads the index offset. However it does 
> something for intel under gcc.

I added the jump table dispatch (but see below for some implications)
and got:

Before:
 '120982986 bytecodes/sec; 3318063 sends/sec'
 '120868744 bytecodes/sec; 3305475 sends/sec'
 '120868744 bytecodes/sec; 3315538 sends/sec'

After:
 '94186902 bytecodes/sec; 3070202 sends/sec'
 '94325718 bytecodes/sec; 3072367 sends/sec'
 '94395280 bytecodes/sec; 3076706 sends/sec'

(on a 1.2GHz P3, WinXP) e.g., a loss in speed. The problem with the
modification is that requires an extra register and all of those that
can be used more or less freely are already assigned (e.g., localSP in
SP_REG, localIP in IP_REG, and currentBytecode in CB_REG). What you see
on top is basically equivalent to trading currentBytecode's register
against the jump table's register. Which makes me wonder: Can you say
what the register allocation is like (e.g., for IP_REG, SP_REG, and
CB_REG) on your machine?!

Also, when I look at the assembly output the code generated for holding
current byte code in a register is both, faster and more compact (see
below). 

[from:]
> 29962546 bytecodes/sec; 952275 sends/sec
[to:]
> 34725990 bytecodes/sec; 1002394 sends/sec
[on:]
> However it's a 350Mhz dual pentium II box, or was that III, 
> with 512MB of ram.

The reason I was interested in the MHz is that it gives you a rough idea
about how much "cycles per bytecode" you spend. This has been in the
ten-cycles range for some time now on Windows (and I believe on Linux as
well). Your values _do_ seem to get into the right range but there's
something funny about it. If I don't misinterpret your and my measures
then we see exactly the opposite effect turning the jumptable on and
off. That sounds _very_ strange, but considering the assembly output I
can't quite imagine that the speed improvement should be due to avoiding
the indirection for the jump table. Is it possible to send me some of
the assembly code of interpret() so we can compare the code directly?!

The only off-hand interpretation I have is that there may be some
compiler differences (but IIRC, then you said you were using 2.95.2
which should be fine). FWIW, below is the relevant part of a common byte
code which is affected by the changes:

	curentBytecode = byteAt(++localIP)

using JP_REG                  using CB_REG (current)
------------------            ------------------
incl %esi				incl %esi
movzbl (%esi),%eax		movzbl (%esi),%ebx
movl %eax,104(%esp)

	break; /* e.g., goto *jumpTable[currentBytecode] */

movl 104(%esp),%edx
jmp *(%ebx,%edx,4)		jmp *_jumpTable.328(,%ebx,4)

Note that when using JP_REG instead of CB_REG we're really referencing
currentBytecode as a stack variable and that seems to hurt a lot more
than the extra indirection through jumpTable.

Cheersm
  - Andreas