Hi Levente,<br><br><div class="gmail_quote">On Sat, Jul 3, 2010 at 6:07 PM, Levente Uzonyi <span dir="ltr">&lt;<a href="mailto:leves@elte.hu">leves@elte.hu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On Thu, 24 Jun 2010, Eliot Miranda wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi &lt;<a href="mailto:leves@elte.hu" target="_blank">leves@elte.hu</a>&gt; wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Tue, 22 Jun 2010, Eliot Miranda wrote:<br>

<br>

&lt;snip&gt;<br>

<br>

<br>

 I can&#39;t say for sure without profiling (you&#39;ll find a good VM profiler<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

QVMProfiler in the image in the tarball, which as yet works on MacOS<br>

only).<br>

<br>

</blockquote>

<br>

This looks promising, I (or someone else :)) just have to implement<br>

#primitiveExecutableModulesAndOffsets under win32 (and un*x), but that<br>

doesn&#39;t seem to be easy (at least the win32 part).<br>

</blockquote>

<br>

<br>

If you look at platforms/win32/vm/sqWin32Backtrace.c you&#39;ll find code that<br>

extracts symbols from dlls for constructing a symbolic backtrace on crashes.<br>

The code also uses a Teleplace.map file generated by the VM makefile which<br>

contains the symbols for the VM.  From this code you ought to be able to be<br>

able to implement a QVMProfilerWin32SymbolsManager almost entirely out of<br>

primitives.<br>

</blockquote>

<br></div>

Thanks for the pointers. I took the &quot;easy&quot; way and did pretty much the same you did for mac. A primitive lists the loaded modules and the symbols are extracted by external programs (nm or the dependency walker if nm didn&#39;t work) via OSProcess (actually ProcessWrapper, because OSProcess doesn&#39;t work on windows). So now I have a version of the QVMProfiler which seems to be working on win32.<br>

</blockquote><div><br></div><div>First, FANTASTIC!  Thanks!  Can you send me your code?  How are you running the external programs, with OSProcess?</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

And now I can answer my original question: Why is the following code slower on the CogVM than on the SqueakVM?<div class="im"><br>

<br>

| s1 s2 |<br>

Smalltalk garbageCollect.<br>

s1 := String streamContents: [ :stream |<br>

        1000 timesRepeat: [<br>

                &#39;aab&#39; do: [ :e | stream nextPut: e; cr ] ] ].<br>

s2 := String streamContents: [ :stream |<br>

        1000 timesRepeat: [<br>

                &#39;abb&#39; do: [ :e | stream nextPut: e; cr ] ] ].<br>

[ TextDiffBuilder from: s1 to: s2 ] timeToRun.<br>

<br></div>

CogVM: 2914<br>

SqueakVM: 1900<br>

<br>

The answer is: Cog didn&#39;t compile the code to machine code, it&#39;s mostly interpreted:<br>

<br>

2.859 seconds; sampling frequency 997 hz<br>

2742 samples in the VM  (2851 samples in the entire program)  96.18% of total<br>

<br>

243 samples in generated vm code   8.86% of entire vm (  8.52% of total)<br>

2499 samples in vanilla vm code 91.14% of entire vm (87.65% of total)<br>

<br>

% of generated vm code (% of total)                                                             (samples) (cumulative)<br>

32.10%    (  2.74%)     DiffElement&gt;&gt;=                                                  (78)            (32.10%)<br>

11.93%    (  1.02%)     ceReturnToInterpreterTrampoline                 (29)            (44.03%)<br>

11.11%    (  0.95%)     String&gt;&gt;compare:with:collated:                          (27)            (55.14%)<br>

  9.88%    (  0.84%)    Object&gt;&gt;at:put:                                                 (24)            (65.02%)<br>

  9.88%    (  0.84%)    String&gt;&gt;=                                                                       (24)            (74.90%)<br>

  6.17%    (  0.53%)    SmallInteger&gt;&gt;=                                                 (15)            (81.07%)<br>

  4.94%    (  0.42%)    cePrimReturnEnterCogCode                                (12)            (86.01%)<br>

  3.70%    (  0.32%)    ByteString class&gt;&gt;compare:with:collated:        (9)             (89.71%)<br>

  2.47%    (  0.21%)    PIC isString                                                                    (6)             (92.18%)<br>

  2.47%    (  0.21%)    ceEnterCogCodePopReceiverReg                    (6)             (94.65%)<br>

  2.47%    (  0.21%)    DiffElement&gt;&gt;hash                                                       (6)             (97.12%)<br>

  0.82%    (  0.07%)    PIC isByteString                                                        (2)             (97.94%)<br>

  0.41%    (  0.04%)    ByteString&gt;&gt;isByteString                                        (1)             (98.35%)<br>

  0.41%    (  0.04%)    DiffElement&gt;&gt;string                                             (1)             (98.77%)<br>

  0.41%    (  0.04%)    SequenceableCollection&gt;&gt;copyFrom:to:    (1)             (99.18%)<br>

  0.41%    (  0.04%)    String&gt;&gt;isString                                                        (1)             (99.59%)<br>

  0.41%    (  0.04%)    WriteStream&gt;&gt;nextPut:                                           (1)             (100.0%)<br>

<br>

<br>

% of vanilla vm code (% of total)                                                                       (samples) (cumulative)<br>

84.23%    (73.83%)      _interpret                                                                      (2105)  (84.23%)<br>

  2.60%    (  2.28%)    _installinAtCacheatstring                                       (65)            (86.83%)<br>

  1.20%    (  1.05%)    _assertValidExecutionPointersimbar              (30)            (88.04%)<br>

etc.<br>

<br>

Whenever this test is repeated, the result is similar. Somehow the CogVM doesn&#39;t compile the code to machine code. But we can force it. Just repeat the test a few times:<div class="im"><br>

<br>

| s1 s2 |<br>

Smalltalk garbageCollect.<br>

s1 := String streamContents: [ :stream |<br>

        1000 timesRepeat: [<br>

                &#39;aab&#39; do: [ :e | stream nextPut: e; cr ] ] ].<br>

s2 := String streamContents: [ :stream |<br>

        1000 timesRepeat: [<br>

                &#39;abb&#39; do: [ :e | stream nextPut: e; cr ] ] ].<br></div>

5 timesRepeat: [<br>

        Transcript show: [ TextDiffBuilder from: s1 to: s2 ] timeToRun; cr ]<br>

<br>

<br>

Transcript:<br>

2905<br>

845<br>

850<br>

843<br>

850<br></blockquote><div><br></div><div>I suspected as much.  At the moment the heuristic the VM uses to decide whether to compile to native code is as follows:</div><div><br></div><div>a) if a method is found in the lookup cache rather than through a class hierarchy lookup (and it has less than L literals) compile it to machine code (i.e. JIT it on the second send)</div>

<div><br></div><div>b) if a block is about to be evaluated interpretively and it is the same as the most recently interpretively evaluated block,  (and its method has less than L literals) compile the method to machine code (i.e. if a block is repeatedly evaluated JIT it on the second repetition)</div>

<div><br></div><div>c) if a method is evaluated through primitiveExecuteMethod[ArgsArray] (and it has less than L literals) compile it to machine code (i.e. always compile doits, because they never get repeated).</div><div>

<br></div><div>So the above gets faster on the second time.  I thought of adding another heuristic for long-running interpreted methods which would be</div><div>d) if an interpreted backwards jump occurs within the same method N times in a row (and its containing method has less than L literals) then compile to machine code and map the current frame from an interpreted one to a machine code one.</div>

<div><br></div><div>This is doable, but it requires mapping of bytecode pcs to machine code pcs, whereas the current code only requires mapping of machine-code pcs to bytecode pcs, and could obscure benchmarks that didn&#39;t repeat twice, because some time would be lost before compiling.  But this heuristic would be the only way things like the finalization loop would ever get jitted and so I think its worth-while doing.  Alas its not my highest priority right now.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">So, when the CogVM generates native code, things get a bit different. The benchmark is evaluated 2.2x faster than with SqueakVM (instead of being 1.5x slower). Here&#39;s the new vmreport from QVMProfiler:<br>


<br>

0.782 seconds; sampling frequency 996 hz<br>

779 samples in the VM   (779 samples in the entire program)  100.0% of total<br>

<br>

545 samples in generated vm code 69.96% of entire vm (69.96% of total)<br>

234 samples in vanilla vm code 30.04% of entire vm (30.04% of total)<br>

<br>

% of generated vm code (% of total)                                                             (samples) (cumulative)<br>

27.34%    (19.13%)      TextDiffBuilder&gt;&gt;lcsFor:and:                            (149)   (27.34%)<br>

16.15%    (11.30%)      Object&gt;&gt;at:                                                             (88)            (43.49%)<br>

15.60%    (10.91%)      SmallInteger&gt;&gt;+                                                 (85)            (59.08%)<br>

10.46%    (  7.32%)     DiffElement&gt;&gt;=                                                  (57)            (69.54%)<br>

  5.32%    (  3.72%)    SmallInteger&gt;&gt;=                                                 (29)            (74.86%)<br>

  4.22%    (  2.95%)    Object&gt;&gt;at:put:                                                 (23)            (79.08%)<br>

  3.49%    (  2.44%)    String&gt;&gt;compare:with:collated:                          (19)            (82.57%)<br>

  2.75%    (  1.93%)    SmallInteger&gt;&gt;&lt;                                                 (15)            (85.32%)<br>

  2.75%    (  1.93%)    String&gt;&gt;=                                                                       (15)            (88.07%)<br>

  1.83%    (  1.28%)    cePrimReturnEnterCogCode                                (10)            (89.91%)<br>

  1.65%    (  1.16%)    SmallInteger&gt;&gt;-                                                 (9)             (91.56%)<br>

  1.47%    (  1.03%)    ByteString class&gt;&gt;compare:with:collated:        (8)             (93.03%)<br>

  1.47%    (  1.03%)    String&gt;&gt;isString                                                        (8)             (94.50%)<br>

etc.<br>

<br>

% of vanilla vm code (% of total) (samples) (cumulative)<br>

24.36%    (  7.32%)     _stObjectatput                                                  (57)            (24.36%)<br>

13.68%    (  4.11%)     _noAtCacheCommonAtPut                           (32)            (38.03%)<br>

  8.97%    (  2.70%)    _primitiveCompareString                         (21)            (47.01%)<br>

  8.55%    (  2.57%)    _ceNewArraySlotSize                                     (20)            (55.56%)<br>

  8.12%    (  2.44%)    _markAndTrace                                                   (19)            (63.68%)<br>

  7.69%    (  2.31%)    _isWordsOrBytesNonInt                           (18)            (71.37%)<br>

  5.13%    (  1.54%)    _sizeOfSTArrayFromCPrimitive                    (12)            (76.50%)<br>

  4.70%    (  1.41%)    _stackValue                                                     (11)            (81.20%)<br>

  3.42%    (  1.03%)    _arrayValueOf                                                   (8)             (84.62%)<br>

etc.<br>

<br>

This also shows that the lack of the machine code version of the #at:put: primitive is not significant in this case.<br></blockquote><div><br></div><div>On the contrary, fully 24.36% +  13.68%, or 38% of entire execution time is going into at:put: (stObjectatput &amp; noAtCacheCommonAtPut are the guts of the interpreter&#39;s at:put: implementation).  The code would probably be netter than 3x faster than the interpreter (up from 2.2x) with a machine-code at:put:.</div>

<div><br></div><div>great that you&#39;ve done this Levente!  Thanks!</div><div><br></div><div>best,</div><div>Eliot</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<font color="#888888">

<br>

<br>

Levente</font><div><div></div><div class="h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

But I expect that the reason is the cost of invoking interpreter primitives<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

from machine code.  Cog only implements a few primitives in machine code<br>

(arithmetic, at: &amp; block value) and for all others (e.g. nextPut: above)<br>

it<br>

executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and<br>

Cog is using the interpreter version.  But the cost of invoking an<br>

interpreter primitive from machine code is higher than invoking it from<br>

the<br>

interpreter because of the system-call-like glue between the machine-code<br>

stack pages and the C stack on which the interpreter primitive runs.<br>

<br>

Three primitives that are currently interpreter primitives but must be<br>

implemented in machine code for better performance are new/basicNew,<br>

new:/basicNew: and at:put:.  I&#39;ve avoided implementing these in machine<br>

code<br>

because the object representation is so complex and am instead about to<br>

start work on a simpler object representation.  When I have that I&#39;ll<br>

implement these primitives and then the speed difference should tilt the<br>

other way.<br>

<br>

</blockquote>

<br>

This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner loop.<br>

One of them (lcss at: max + k + 1 put: lcs) can be eliminated without<br>

affecting the computation, because that just stores the results. So without<br>

only one #at:put: it took me 2423ms to run the benchmark. Which is still a<br>

bit too high. I think only the profiler can help here.<br>

<br>

Btw, is MessageTally less accurate with CogVM than with the SqueakVM?<br>

<br>

</blockquote>

<br>

I&#39;m not sure.  We use a variant written by Andreas that is more accurate<br>

than MessageTally but that may use different plumbing.<br>

<br>

best<br>

Eliot<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Levente<br>

<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Of course if anyone would like to implement these in the context of the<br>

current object representation be my guest and report back asap...<br>

<br>

best<br>

Eliot<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Levente<br>

<br>

<br>

<br>

</blockquote>

<br>

</blockquote>

<br>

</blockquote>

<br>

</blockquote>

<br>

</div></div></blockquote></div><br>