Re: [squeak-dev] Cog performance

23 Jun 2010


      Hi Levente,
On Tue, Jun 22, 2010 at 2:28 PM, Levente Uzonyi leves@elte.hu wrote:
...
Hi,
I was curious how much speedup Cog gives when the code has only a few
message sends, so I ran the following "benchmark":
| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.
The above pattern makes TextDiffBuilder >> #lcsFor:and: run for a while. My
results are a bit surprising:
CogVM: 2914
SqueakVM: 1900
MessageTally shows that (I wonder if it's accurate with Cog at all) CogVM's
garbage collector is a bit better, but it runs the code slower than
SqueakVM:
CogVM:
**Leaves**
60.6% {1886ms} TextDiffBuilder>>lcsFor:and:
36.2% {1127ms} DiffElement>>=
1.8% {56ms} ByteString(String)>>=
**GCs**
       full                    1 totalling 153ms (5.0% uptime), avg 153.0ms
       incr            21 totalling 76ms (2.0% uptime), avg 4.0ms
       tenures         13 (avg 1 GCs/tenure)
       root table      0 overflows
SqueakVM:
**Leaves**
46.8% {888ms} TextDiffBuilder>>lcsFor:and:
35.3% {670ms} DiffElement>>=
9.8% {186ms} ByteString(String)>>compare:with:collated:
6.9% {131ms} ByteString(String)>>=
**GCs**
       full                    3 totalling 254ms (13.0% uptime), avg 85.0ms
       incr            301 totalling 110ms (6.0% uptime), avg 0.0ms
       tenures         272 (avg 1 GCs/tenure)
       root table      0 overflows
Is Cog slower because #to:do: loops are not optimized, or is there some
other reason for the slowdown?
I can't say for sure without profiling (you'll find a good VM profiler
QVMProfiler in the image in the tarball, which as yet works on MacOS only).
But I expect that the reason is the cost of invoking interpreter primitives
from machine code.  Cog only implements a few primitives in machine code
(arithmetic, at: & block value) and for all others (e.g. nextPut: above) it
executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
Cog is using the interpreter version.  But the cost of invoking an
interpreter primitive from machine code is higher than invoking it from the
interpreter because of the system-call-like glue between the machine-code
stack pages and the C stack on which the interpreter primitive runs.
Three primitives that are currently interpreter primitives but must be
implemented in machine code for better performance are new/basicNew,
new:/basicNew: and at:put:.  I've avoided implementing these in machine code
because the object representation is so complex and am instead about to
start work on a simpler object representation.  When I have that I'll
implement these primitives and then the speed difference should tilt the
other way.
Of course if anyone would like to implement these in the context of the
current object representation be my guest and report back asap...
best
Eliot
...
Levente