[squeak-dev] Cog performance

Levente Uzonyi leves at elte.hu
Sun Jul 4 01:07:44 UTC 2010


On Thu, 24 Jun 2010, Eliot Miranda wrote:

> On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi <leves at elte.hu> wrote:
>
>> On Tue, 22 Jun 2010, Eliot Miranda wrote:
>>
>> <snip>
>>
>>
>>  I can't say for sure without profiling (you'll find a good VM profiler
>>> QVMProfiler in the image in the tarball, which as yet works on MacOS
>>> only).
>>>
>>
>> This looks promising, I (or someone else :)) just have to implement
>> #primitiveExecutableModulesAndOffsets under win32 (and un*x), but that
>> doesn't seem to be easy (at least the win32 part).
>
>
> If you look at platforms/win32/vm/sqWin32Backtrace.c you'll find code that
> extracts symbols from dlls for constructing a symbolic backtrace on crashes.
> The code also uses a Teleplace.map file generated by the VM makefile which
> contains the symbols for the VM.  From this code you ought to be able to be
> able to implement a QVMProfilerWin32SymbolsManager almost entirely out of
> primitives.

Thanks for the pointers. I took the "easy" way and did pretty much the 
same you did for mac. A primitive lists the loaded modules and the 
symbols are extracted by external programs (nm or the dependency walker 
if nm didn't work) via OSProcess (actually ProcessWrapper, because 
OSProcess doesn't work on windows). So now I have a version of the 
QVMProfiler which seems to be working on win32.

And now I can answer my original question: Why is the following code 
slower on the CogVM than on the SqueakVM?

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.

CogVM: 2914
SqueakVM: 1900

The answer is: Cog didn't compile the code to machine code, it's mostly 
interpreted:

2.859 seconds; sampling frequency 997 hz
2742 samples in the VM	(2851 samples in the entire program)  96.18% of total

243 samples in generated vm code   8.86% of entire vm (  8.52% of total)
2499 samples in vanilla vm code 91.14% of entire vm (87.65% of total)

% of generated vm code (% of total)								(samples) (cumulative)
32.10%    (  2.74%)	DiffElement>>=							(78)		(32.10%)
11.93%    (  1.02%)	ceReturnToInterpreterTrampoline			(29)		(44.03%)
11.11%    (  0.95%)	String>>compare:with:collated:				(27)		(55.14%)
   9.88%    (  0.84%)	Object>>at:put:							(24)		(65.02%)
   9.88%    (  0.84%)	String>>=									(24)		(74.90%)
   6.17%    (  0.53%)	SmallInteger>>=							(15)		(81.07%)
   4.94%    (  0.42%)	cePrimReturnEnterCogCode				(12)		(86.01%)
   3.70%    (  0.32%)	ByteString class>>compare:with:collated:	(9)		(89.71%)
   2.47%    (  0.21%)	PIC isString									(6)		(92.18%)
   2.47%    (  0.21%)	ceEnterCogCodePopReceiverReg			(6)		(94.65%)
   2.47%    (  0.21%)	DiffElement>>hash							(6)		(97.12%)
   0.82%    (  0.07%)	PIC isByteString							(2)		(97.94%)
   0.41%    (  0.04%)	ByteString>>isByteString					(1)		(98.35%)
   0.41%    (  0.04%)	DiffElement>>string						(1)		(98.77%)
   0.41%    (  0.04%)	SequenceableCollection>>copyFrom:to:	(1)		(99.18%)
   0.41%    (  0.04%)	String>>isString							(1)		(99.59%)
   0.41%    (  0.04%)	WriteStream>>nextPut:						(1)		(100.0%)


% of vanilla vm code (% of total)									(samples) (cumulative)
84.23%    (73.83%)	_interpret									(2105)	(84.23%)
   2.60%    (  2.28%)	_installinAtCacheatstring					(65)		(86.83%)
   1.20%    (  1.05%)	_assertValidExecutionPointersimbar		(30)		(88.04%)
etc.

Whenever this test is repeated, the result is similar. Somehow the CogVM 
doesn't compile the code to machine code. But we can force it. Just repeat 
the test a few times:

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'abb' do: [ :e | stream nextPut: e; cr ] ] ].
5 timesRepeat: [
 	Transcript show: [ TextDiffBuilder from: s1 to: s2 ] timeToRun; cr ]


Transcript:
2905
845
850
843
850

So, when the CogVM generates native code, things get a bit different. The 
benchmark is evaluated 2.2x faster than with SqueakVM (instead of being 
1.5x slower). Here's the new vmreport from QVMProfiler:

0.782 seconds; sampling frequency 996 hz
779 samples in the VM	(779 samples in the entire program)  100.0% of total

545 samples in generated vm code 69.96% of entire vm (69.96% of total)
234 samples in vanilla vm code 30.04% of entire vm (30.04% of total)

% of generated vm code (% of total)								(samples) (cumulative)
27.34%    (19.13%)	TextDiffBuilder>>lcsFor:and:				(149)	(27.34%)
16.15%    (11.30%)	Object>>at:								(88)		(43.49%)
15.60%    (10.91%)	SmallInteger>>+							(85)		(59.08%)
10.46%    (  7.32%)	DiffElement>>=							(57)		(69.54%)
   5.32%    (  3.72%)	SmallInteger>>=							(29)		(74.86%)
   4.22%    (  2.95%)	Object>>at:put:							(23)		(79.08%)
   3.49%    (  2.44%)	String>>compare:with:collated:				(19)		(82.57%)
   2.75%    (  1.93%)	SmallInteger>><							(15)		(85.32%)
   2.75%    (  1.93%)	String>>=									(15)		(88.07%)
   1.83%    (  1.28%)	cePrimReturnEnterCogCode				(10)		(89.91%)
   1.65%    (  1.16%)	SmallInteger>>-							(9)		(91.56%)
   1.47%    (  1.03%)	ByteString class>>compare:with:collated:	(8)		(93.03%)
   1.47%    (  1.03%)	String>>isString							(8)		(94.50%)
etc.

% of vanilla vm code (% of total) 
(samples) (cumulative)
24.36%    (  7.32%)	_stObjectatput							(57)		(24.36%)
13.68%    (  4.11%)	_noAtCacheCommonAtPut				(32)		(38.03%)
   8.97%    (  2.70%)	_primitiveCompareString				(21)		(47.01%)
   8.55%    (  2.57%)	_ceNewArraySlotSize					(20)		(55.56%)
   8.12%    (  2.44%)	_markAndTrace							(19)		(63.68%)
   7.69%    (  2.31%)	_isWordsOrBytesNonInt				(18)		(71.37%)
   5.13%    (  1.54%)	_sizeOfSTArrayFromCPrimitive			(12)		(76.50%)
   4.70%    (  1.41%)	_stackValue							(11)		(81.20%)
   3.42%    (  1.03%)	_arrayValueOf							(8)		(84.62%)
etc.

This also shows that the lack of the machine code version of the #at:put: 
primitive is not significant in this case.


Levente

>
>
> But I expect that the reason is the cost of invoking interpreter primitives
>>> from machine code.  Cog only implements a few primitives in machine code
>>> (arithmetic, at: & block value) and for all others (e.g. nextPut: above)
>>> it
>>> executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
>>> Cog is using the interpreter version.  But the cost of invoking an
>>> interpreter primitive from machine code is higher than invoking it from
>>> the
>>> interpreter because of the system-call-like glue between the machine-code
>>> stack pages and the C stack on which the interpreter primitive runs.
>>>
>>> Three primitives that are currently interpreter primitives but must be
>>> implemented in machine code for better performance are new/basicNew,
>>> new:/basicNew: and at:put:.  I've avoided implementing these in machine
>>> code
>>> because the object representation is so complex and am instead about to
>>> start work on a simpler object representation.  When I have that I'll
>>> implement these primitives and then the speed difference should tilt the
>>> other way.
>>>
>>
>> This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner loop.
>> One of them (lcss at: max + k + 1 put: lcs) can be eliminated without
>> affecting the computation, because that just stores the results. So without
>> only one #at:put: it took me 2423ms to run the benchmark. Which is still a
>> bit too high. I think only the profiler can help here.
>>
>> Btw, is MessageTally less accurate with CogVM than with the SqueakVM?
>>
>
> I'm not sure.  We use a variant written by Andreas that is more accurate
> than MessageTally but that may use different plumbing.
>
> best
> Eliot
>
>
>>
>> Levente
>>
>>
>>
>>> Of course if anyone would like to implement these in the context of the
>>> current object representation be my guest and report back asap...
>>>
>>> best
>>> Eliot
>>>
>>>
>>>>
>>>> Levente
>>>>
>>>>
>>>>
>>>
>>
>



More information about the Squeak-dev mailing list