[Vm-dev] Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Fri Aug 7 13:04:47 UTC 2009

2009/8/7 Andreas Raab <andreas.raab at gmx.de>:
>
> Eliot Miranda wrote:
>>
>> The first incarnation of the Cog JIT is complete (for x86 only) and in use
>> at Qwaq.  We are gearing up for a new server release and the Cog VM is the
>> Vm beneath it.  The next client release will include it also.  This VM has a
>> naive code generator (every push or pop in the bytecode results in a push or
>> pop in machine code) but good inline cacheing.  Performance is as high as 5x
>> the current interpreter for certain computer-language-shootout benchmarks.
>>  The naive code generator means there is poor loop performance (1 to: n do:
>> ... style code can be 4 times slower than VisualWorks) and the object model
>> means there is no machine code instance creation and no machine code at:put:
>> primitive.  But send performance is good and block activation almost as fast
>> as VisualWorks.  In our real-world experience we were last week able to run
>> almost three times as many Qwaq Forums clients against a QF server running
>> on the Cog VM than we were able to above the interpreters.  So the Cog JIT
>> is providing significant speedups in real-world use.
>
> Indeed. Here some numbers that I took earlier this year:
>
> VM version           bc/sec  sends/sec  Macro1  Macro2  Macro5    Total
> Closure(3.11.2) 198,295,894  5,801,773  3124ms  79333ms 9935ms  92411ms
> Stack (2.0.10)  178,521,617  8,141,165  2136ms  43081ms 6874ms  52117ms

it was always confusing to me, how it is possible to have higher send
rate & lower bytecode execution rate at the same time.
The way how tinybenchmark calculating it is tricky one.

> Cog (current)   199,221,789 17,509,420   982ms  29392ms 4053ms  34445ms
> Stack vs. Closure      0.9        1.4     1.46     1.84   1.45     1.77
> Cog vs. Stack          1.12       2.16    2.17     1.46   1.69     1.51
> Cog vs. Closure        1.0        3.0     3.18     2.7    2.45     2.68
>
> As a total improvement in performance Cog ranks at approx. 2.7x faster in
> macro benchmarks than what we started from. That's a pretty decent bit of
> speedup for real-world applications.
>
> Compare this (for example) with j3 [1] which despite a speedup of 6x in
> microbenchmarks only provided a 2x speedup in the macros.
>
> [1] http://aspn.activestate.com/ASPN/Mail/Message/squeak-list/2369033:
>
> "Of course, that was 2001. Revisiting the benchmarks is kind of
> interesting...
>
> Interp:     '43805612 bytecodes/sec; 1325959 sends/sec'
> J3:         '135665076 bytecodes/sec; 8100691 sends/sec'
>
> Today: (PowerBookG4 1.5GHz), interp:
>
>            '114387846 bytecodes/sec; 5152891 sends/sec'
>
> But the mircoBenchmarks don't tell the whole story: Even with a speedup
> of factor 6 in sends, we only saw the performance doubled on real world
> benchmarks (e.g. the MacroBenchmarks)."
>
>
> Cheers,
>  - Andreas
>

-- 
Best regards,
Igor Stasenko AKA sig.