Hi Levente,
On Tue, Jun 22, 2010 at 2:28 PM, Levente Uzonyi leves@elte.hu wrote:
Hi,
I was curious how much speedup Cog gives when the code has only a few message sends, so I ran the following "benchmark":
| s1 s2 | Smalltalk garbageCollect. s1 := String streamContents: [ :stream | 1000 timesRepeat: [ 'aab' do: [ :e | stream nextPut: e; cr ] ] ]. s2 := String streamContents: [ :stream | 1000 timesRepeat: [ 'abb' do: [ :e | stream nextPut: e; cr ] ] ]. [ TextDiffBuilder from: s1 to: s2 ] timeToRun.
The above pattern makes TextDiffBuilder >> #lcsFor:and: run for a while. My results are a bit surprising: CogVM: 2914 SqueakVM: 1900
MessageTally shows that (I wonder if it's accurate with Cog at all) CogVM's garbage collector is a bit better, but it runs the code slower than SqueakVM:
CogVM: **Leaves** 60.6% {1886ms} TextDiffBuilder>>lcsFor:and: 36.2% {1127ms} DiffElement>>= 1.8% {56ms} ByteString(String)>>=
**GCs** full 1 totalling 153ms (5.0% uptime), avg 153.0ms incr 21 totalling 76ms (2.0% uptime), avg 4.0ms tenures 13 (avg 1 GCs/tenure) root table 0 overflows
SqueakVM: **Leaves** 46.8% {888ms} TextDiffBuilder>>lcsFor:and: 35.3% {670ms} DiffElement>>= 9.8% {186ms} ByteString(String)>>compare:with:collated: 6.9% {131ms} ByteString(String)>>=
**GCs** full 3 totalling 254ms (13.0% uptime), avg 85.0ms incr 301 totalling 110ms (6.0% uptime), avg 0.0ms tenures 272 (avg 1 GCs/tenure) root table 0 overflows
Is Cog slower because #to:do: loops are not optimized, or is there some other reason for the slowdown?
I can't say for sure without profiling (you'll find a good VM profiler QVMProfiler in the image in the tarball, which as yet works on MacOS only). But I expect that the reason is the cost of invoking interpreter primitives from machine code. Cog only implements a few primitives in machine code (arithmetic, at: & block value) and for all others (e.g. nextPut: above) it executes the interpreter primitives. lcsFor:and: uses at:put: heavily and Cog is using the interpreter version. But the cost of invoking an interpreter primitive from machine code is higher than invoking it from the interpreter because of the system-call-like glue between the machine-code stack pages and the C stack on which the interpreter primitive runs.
Three primitives that are currently interpreter primitives but must be implemented in machine code for better performance are new/basicNew, new:/basicNew: and at:put:. I've avoided implementing these in machine code because the object representation is so complex and am instead about to start work on a simpler object representation. When I have that I'll implement these primitives and then the speed difference should tilt the other way.
Of course if anyone would like to implement these in the context of the current object representation be my guest and report back asap...
best Eliot
Levente