[Vm-dev] Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Fri Aug 7 07:19:10 UTC 2009

Eliot Miranda wrote:
> The first incarnation of the Cog JIT is complete (for x86 only) and in 
> use at Qwaq.  We are gearing up for a new server release and the Cog VM 
> is the Vm beneath it.  The next client release will include it also. 
>  This VM has a naive code generator (every push or pop in the bytecode 
> results in a push or pop in machine code) but good inline cacheing. 
>  Performance is as high as 5x the current interpreter for certain 
> computer-language-shootout benchmarks.  The naive code generator means 
> there is poor loop performance (1 to: n do: ... style code can be 4 
> times slower than VisualWorks) and the object model means there is no 
> machine code instance creation and no machine code at:put: primitive. 
>  But send performance is good and block activation almost as fast as 
> VisualWorks.  In our real-world experience we were last week able to run 
> almost three times as many Qwaq Forums clients against a QF server 
> running on the Cog VM than we were able to above the interpreters.  So 
> the Cog JIT is providing significant speedups in real-world use.

Indeed. Here some numbers that I took earlier this year:

VM version           bc/sec  sends/sec  Macro1  Macro2  Macro5    Total
Closure(3.11.2) 198,295,894  5,801,773  3124ms  79333ms 9935ms  92411ms
Stack (2.0.10)  178,521,617  8,141,165  2136ms  43081ms 6874ms  52117ms
Cog (current)   199,221,789 17,509,420   982ms  29392ms 4053ms  34445ms
Stack vs. Closure      0.9        1.4     1.46     1.84   1.45     1.77
Cog vs. Stack          1.12       2.16    2.17     1.46   1.69     1.51
Cog vs. Closure        1.0        3.0     3.18     2.7    2.45     2.68

As a total improvement in performance Cog ranks at approx. 2.7x faster 
in macro benchmarks than what we started from. That's a pretty decent 
bit of speedup for real-world applications.

Compare this (for example) with j3 [1] which despite a speedup of 6x in 
microbenchmarks only provided a 2x speedup in the macros.

[1] http://aspn.activestate.com/ASPN/Mail/Message/squeak-list/2369033:

"Of course, that was 2001. Revisiting the benchmarks is kind of
interesting...

Interp:     '43805612 bytecodes/sec; 1325959 sends/sec'
J3:         '135665076 bytecodes/sec; 8100691 sends/sec'

Today: (PowerBookG4 1.5GHz), interp:

             '114387846 bytecodes/sec; 5152891 sends/sec'

But the mircoBenchmarks don't tell the whole story: Even with a speedup
of factor 6 in sends, we only saw the performance doubled on real world
benchmarks (e.g. the MacroBenchmarks)."

Cheers,
   - Andreas