Hi Klaus,<br><br><div><span class="gmail_quote">On 12/17/06, <b class="gmail_sendername">Klaus D. Witzel</b> &lt;<a href="mailto:klaus.witzel@cobss.com">klaus.witzel@cobss.com</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Thank you David for answering my question<br><br>&gt;&gt; Can somebody reproduce the figures, any other results? Have I done<br>&gt;&gt; something wrong?<br><br>and thank you also for the explanations. I understand that PICs in

<br>Strongtalk are [in the current incarnation] limited to 4 entries, that's<br>good to know.<br><br>Just a minor adjustment: the #at: on the array was never in doubt and the<br>integer loop was by intention because (I think) on all three systems it's

<br>compiled away already at the bytecode level and the #at: is expected to be<br>subsummed at the primitive level. I've seen walkbacks in Strongtalk in<br>which the source code #to:do: was inlined with #whileTrue sans block, like

in Squeak.</blockquote><div> Yes, #to:do: is treated specially by the bytecode compiler, although it doesn't really have to be, since type-feedback would be able to inline and eliminate the block.&nbsp; The only reason it is treated specially is just so it still runs reasonable fast in the interpreter, before methods are compiled, because it is so important in inner loops.&nbsp; #at:, on the other hand, is not treated specially in Strongtalk, unlike most other Smalltalks.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">As to you figures, will retry with a &quot;warmer&quot; image :)<br><br>And I have nothing against people calling my test a poor benchmark. I

<br>wanted to compare the performance at this particular level and according<br>to your report even there [the at this level unoptimized] Strongtalk is<br>close to VW. And no, I would never say that mega-morphic sends is all what

<br>Smalltalk is about.<br><br>Let me comment this one<br><br>&gt; ...&nbsp;&nbsp;How much of your code really looks like that?<br><br>Well, at that level almost all users of collection #do: look like that. I<br>just made the level below an O(1) constant, otherwise the polymorphic

nature of &quot;(array at: i) doSomethingPolymorphically&quot; would perhaps have gone unnoticed.</blockquote><div> #do: loops are significantly different, because 1) they are not treated specially by the bytecode compiler, so there is a real block and usually a closure in most Smalltalks, 2) the implementation of #do:, which is where the inner loop might be, does not literally contain the body of the loop, so loop unrolling can't be applied by a non-inlining Smalltalk.&nbsp; Array bounds-check elimination might apply, but when the loop contains more than a few sends (including the additional Block&gt;&gt;value: send), the benefits rapidly become minor.

So in fact, a #do: benchmark (with a block that needs a closure, since all real #do: sends need a closure) would be a much better benchmark, because it's the way people actually write code, and sure enough Strongtalk can both inline the #do: implementation, and inline the block into the loop, so it would show much bigger advantages compared to other Smalltalks.&nbsp; And even that would understate the potential Strongtalk advantage, because if the compiler was tuned, it would be able to do bounds-check elimination and loop unrolling even for #do:, because it can inline the block, whereas VisualWorks would never be able to.

<br><br>Cheers,<br>Dave<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Thanks again, very insightful.<br><br>/Klaus<br><br>On Mon, 18 Dec 2006 00:08:08 +0100, David Griswold

<br>&lt;<a href="mailto:david.griswold.256@gmail.com">david.griswold.256@gmail.com</a>&gt; wrote:<br><br>&gt; Klaus,<br>&gt;<br>&gt; There are three issues here:<br>&gt;<br>&gt; 1) You did *not* run it enough under Strongtalk to compile the

<br>&gt; benchmark, so<br>&gt; you are measuring interpreted performance.&nbsp;&nbsp;You need to run it until the<br>&gt; performance speeds up and stabilizes.&nbsp;&nbsp; When it is compiled, on my<br>&gt; machine<br>&gt; (Sonoma Pentium M 

1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and<br>&gt; Strongtalk runs it in 22139.&nbsp;&nbsp;That's not the latest Squeak but I doubt it<br>&gt; has changed much.&nbsp;&nbsp;I don't have a recent VisualWorks installed, but from<br>&gt; my

<br>&gt; knowledge of how the various systems work, I would expect VisualWorks to<br>&gt; be<br>&gt; a bit faster than Strongtalk at this (very poor) microbenchmark, for<br>&gt; reasons<br>&gt; explained below.<br>&gt;<br>

&gt; 2) Andreas Raab was right in his comments.&nbsp;&nbsp;The performance you are<br>&gt; measuring is *not* general Smalltalk performance, it is specifically the<br>&gt; performance of megamorphic sends, which are one of the few cases where

<br>&gt; Strongtalk's type-feedback doesn't help at all.<br>&gt;<br>&gt; Here is how sends work in Strongtalk:<br>&gt;<br>&gt; Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at<br>&gt; the<br>&gt; send site) can be inlined, which is the common case (over 90% of sends

<br>&gt; fall<br>&gt; in this category), and that is where Strongtalk can give you big<br>&gt; speedups.<br>&gt;<br>&gt; Sends that have between 2 and 4 receiver classes are usually handled<br>&gt; with a<br>&gt; polymorphic inline cache (PIC), which is still a real dispatch and call,

<br>&gt; and<br>&gt; is only slightly faster (if at all) than in other Smalltalks, since that<br>&gt; is<br>&gt; the most highly optimized piece of code in any normal Smalltalk<br>&gt; implementation.&nbsp;&nbsp;PICs are not primarily for optimization; their real

<br>&gt; role is<br>&gt; to gather type information for the inlining compiler.&nbsp;&nbsp;Note that<br>&gt; VisualWorks<br>&gt; now has PICs, so it uses the same technology for non-inlined sends as<br>&gt; Strongtalk.<br>&gt;<br>&gt; Sends that have more than 4 receiver types, such as your micro-benchmark,

<br>&gt; can't even use PICs or any kind of inline cache, so these are a full<br>&gt; megamorphic send in Strongtalk, which is implemented as an actual hashed<br>&gt; lookup, which is the slowest case of all.&nbsp;&nbsp;You might say that is what

<br>&gt; Smalltalk is all about, but in reality megamorphic sends are relatively<br>&gt; rare<br>&gt; as a percentage of sends.&nbsp;&nbsp;Compilers aren't magic- no one can eliminate<br>&gt; the<br>&gt; fundamental computation that a truly megamorphic send has to do- it

<br>&gt; *has* to<br>&gt; do some kind of real lookup, and a call, so the performance will<br>&gt; naturally<br>&gt; be similar across all Smalltalks.<br>&gt;<br>&gt; Every Smalltalk has that overhead.&nbsp;&nbsp;What Strongtalk does is eliminate

<br>&gt; that<br>&gt; overhead when you don't really need it, when a send doesn't actually have<br>&gt; many receiver classes.&nbsp;&nbsp;That is what other Smalltalk's can't do: they<br>&gt; make<br>&gt; you pay the cost of a dispatch and call all the time, even if you don't

&gt; need &gt; it, which is the common case. &gt; &gt; So your 'picBench' isn't even measuring PIC performance. &gt; &gt; 3) I would expect VisualWorks to be about the same speed or a bit faster &gt; than Strongtalk on this atypical benchmark because of several factors.

<br>&gt; We<br>&gt; have established that type-feedback doesn't help this benchmark, so from<br>&gt; the<br>&gt; point of view of sends, VisualWorks and Strongtalk would be doing<br>&gt; basically<br>&gt; the same kind of things.&nbsp;&nbsp;The reason VisualWorks would probably be a bit

<br>&gt; faster on this benchmark is because it probably does array bounds-check<br>&gt; elimination and maybe even loop unrolling, which aren't yet implemented<br>&gt; in<br>&gt; Strongtalk, and I'm sure aren't implemented in Squeak.&nbsp;&nbsp;We did those in

<br>&gt; the<br>&gt; Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't<br>&gt; even really been tuned, and VisualWorks has been tuned for many years.<br>&gt; Your<br>&gt; benchmark consists of a tight inner loop that does only two things: a

&gt; megamorphic send, and an array lookup.&nbsp;&nbsp;So the array bounds check and &gt; loop &gt; overhead are a significant factor, and if VisualWorks can optimize &gt; those, it &gt; would make a real difference.

<br>&gt;<br>&gt; But once again, this is not even remotely typical Smalltalk code.&nbsp;&nbsp;Array<br>&gt; bounds-checks and loop unrolling are rarely used optimizations that<br>&gt; generally only help when you have a very tight inner loop that does

<br>&gt; almost<br>&gt; nothing and where the loop itself is a literal SmallInteger&gt;&gt;to:do: send,<br>&gt; you are accessing an array, and the array access is literally imbedded in<br>&gt; the loop, not in a called method.&nbsp;&nbsp;How much of your code really looks

<br>&gt; like<br>&gt; that?<br>&gt;<br>&gt; -Dave<br>&gt;<br>&gt; On 12/17/06, Klaus D. Witzel &lt;<a href="mailto:klaus.witzel@cobss.com">klaus.witzel@cobss.com</a>&gt; wrote:<br>&gt;&gt;<br>&gt;&gt; Folks,<br>&gt;&gt;<br>

&gt;&gt; I'm sorry to tell that Strongtalk is NOT that fast. I followed the<br>&gt;&gt; instructions and *compiled* the following benchmark in Strongtalk,<br>&gt;&gt; evaluated the same expression in Squeak and in VW and got the these

<br>&gt;&gt; results on my 1.73GHz 1.0GB WinXP notebook:<br>&gt;&gt;<br>&gt;&gt; - VisualWorks:&nbsp;&nbsp;16799 (N.C. 7.4.1)<br>&gt;&gt; - Strongtalk:&nbsp;&nbsp; 47517 (1.1.2)<br>&gt;&gt; - Squeak:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 56726 (3.9#7056)<br>&gt;&gt;

<br>&gt;&gt; Below is the Squeak/VW source code, attached is the Strongtalk source<br>&gt;&gt; code. The test is simple: a long loop around a single polymorphic call<br>&gt;&gt; site &quot;(instances at: i) yourself&quot;, straight forward inlineable and with

<br>&gt;&gt; intentionally unpredictable type information at the call site (modeled<br>&gt;&gt; after the Thue-Morse sequence).<br>&gt;&gt;<br>&gt;&gt; I'm disappointed, Strongtalk was always advertised as being the fastest

<br>&gt;&gt; Smalltalk available &quot;...executes Smalltalk much faster than any other<br>&gt;&gt; Smalltalk implementation...&quot;, and now it shows to be in almost the same<br>&gt;&gt; class as Squeak is :) :(<br>&gt;&gt;

<br>&gt;&gt; Can somebody reproduce the figures, any other results? Have I done<br>&gt;&gt; something wrong?<br>&gt;&gt;<br>&gt;&gt; BTW: congrats to the implementors of Squeak and, of course, to Cincom!<br>&gt;&gt; (uhm, and also to the Strongtalk team!)

<br>&gt;&gt;<br>&gt;&gt; /Klaus<br>&gt;&gt;<br>&gt;&gt; --------------<br>&gt;&gt;&nbsp;&nbsp; | instances base |<br>&gt;&gt;&nbsp;&nbsp; base := (Array<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: OrderedCollection basicNew<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: SequenceableCollection basicNew

<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Collection basicNew<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Object basicNew) ,<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Array<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Character space<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Date basicNew<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Time basicNew

&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with: Magnitude basicNew). &gt;&gt;&nbsp;&nbsp; instances := OrderedCollection with: (base at: 1). &gt;&gt;&nbsp;&nbsp; 2 to: base size do: [:i | &gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;instances := instances , instances reverse. &gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;instances addLast: (base at: i)].

<br>&gt;&gt;&nbsp;&nbsp; instances := (instances , instances reverse) asArray.<br>&gt;&gt;&nbsp;&nbsp; ^ Time millisecondsToRun: [<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1234567 timesRepeat: [<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 to: instances size do: [:i |<br>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (instances at: i) yourself]]]

<br>&gt;&gt; --------------<br>&gt;&gt;<br><br><br><br></blockquote></div><br>