Folks,
I'm sorry to tell that Strongtalk is NOT that fast. I followed the instructions and *compiled* the following benchmark in Strongtalk, evaluated the same expression in Squeak and in VW and got the these results on my 1.73GHz 1.0GB WinXP notebook:
- VisualWorks: 16799 (N.C. 7.4.1) - Strongtalk: 47517 (1.1.2) - Squeak: 56726 (3.9#7056)
Below is the Squeak/VW source code, attached is the Strongtalk source code. The test is simple: a long loop around a single polymorphic call site "(instances at: i) yourself", straight forward inlineable and with intentionally unpredictable type information at the call site (modeled after the Thue-Morse sequence).
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :(
Can somebody reproduce the figures, any other results? Have I done something wrong?
BTW: congrats to the implementors of Squeak and, of course, to Cincom! (uhm, and also to the Strongtalk team!)
/Klaus
-------------- | instances base | base := (Array with: OrderedCollection basicNew with: SequenceableCollection basicNew with: Collection basicNew with: Object basicNew) , (Array with: Character space with: Date basicNew with: Time basicNew with: Magnitude basicNew). instances := OrderedCollection with: (base at: 1). 2 to: base size do: [:i | instances := instances , instances reverse. instances addLast: (base at: i)]. instances := (instances , instances reverse) asArray. ^ Time millisecondsToRun: [ 1234567 timesRepeat: [ 1 to: instances size do: [:i | (instances at: i) yourself]]] --------------
Klaus D. Witzel wrote:
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :(
Can somebody reproduce the figures, any other results? Have I done something wrong?
Yes. First, you are equating the result of a single micro-benchmark with overall system performance. Micro-benchmarks are used to measure specific aspects of a particular implementation. Your benchmark measures highly polymorphic send performance. Which is not typical for Smalltalk code to begin with.
In other words, your claim is based on measuring a single atypical performance characteristic. This has *nothing* to do with "Smalltalk performance". If you want to measure "Smalltalk performance" you should run a number of the standard benchmarks (Richards, Slopstone) that come with Strongtalk and compare those.
Quite honestly, I'm surprised to see a person like you who obviously understands enough about dynamic systems to measure PIC effects to make such unsubstantiated claims. I would expect that you know how to evaluate the results of a micro-benchmarks, and I would in particular expect that you know that 80-90% of all call-sites in realistic code are mono-morphic to begin with which render your benchmark results absolutely useless for "Smalltalk code".
Cheers, - Andreas
Hi Andreas,
on Sun, 17 Dec 2006 11:52:36 +0100, you wrote:
Klaus D. Witzel wrote:
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :( Can somebody reproduce the figures, any other results? Have I done something wrong?
Yes. First, you are equating the result of a single micro-benchmark with overall system performance. Micro-benchmarks are used to measure specific aspects of a particular implementation. Your benchmark measures highly polymorphic send performance. Which is not typical for Smalltalk code to begin with.
In other words, your claim is based on measuring a single atypical performance characteristic.
Not really. There are at least two motivations:
- a - being faster means >= and the = part is missing - b - the test puts some stress on the call site, any specific suggestion from your side on how to test and compare that on typical situations (this is *not* a rhetorical question)? And, no it's not atypical, see below.
This has *nothing* to do with "Smalltalk performance". If you want to measure "Smalltalk performance" you should run a number of the standard benchmarks (Richards, Slopstone) that come with Strongtalk and compare those.
Well, how about something new, or are you after stangnation, Andreas (no offense, really ;-)
Quite honestly, I'm surprised to see a person like you who obviously understands enough about dynamic systems to measure PIC effects to make such unsubstantiated claims.
O.K. I understand that as lack of use case. Take this (take that ;-)
| allCs | allCs := Smalltalk allClasses. "start timing here" 1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the iVar"] "note that #yourself from the previous example is now just #methodDict"
This snippet is performed on behalf of every developer who asks for senders and/or implementors. It is, IMHO, the most often used piece of code of every Smalltalk, ever.
So I limited the amount of information to be handled by PICs (as noted in the comment of Michael), otherwise the comparision would've just been on the possible "methodCache" performance (which would've been a nice test as well [after eliminating differences with (Smalltalk size)] but I was not looking for that).
But, even for the latter I expect to find the = in >= when Strongtalk "...executes Smalltalk much faster than any other Smalltalk implementation...".
I would expect that you know how to evaluate the results of a micro-benchmarks, and I would in particular expect that you know that 80-90% of all call-sites in realistic code are mono-morphic to begin with which render your benchmark results absolutely useless for "Smalltalk code".
Absolutely not (yes, I know about these figures. no, I disagree: see above).
BTW: does anybody know about recent work in the direction of "Message Dispatch on Pipelined Processors", thanks for any pointers.
/Klaus
Cheers,
- Andreas
Klaus D. Witzel wrote:
O.K. I understand that as lack of use case. Take this (take that ;-)
| allCs | allCs := Smalltalk allClasses. "start timing here" 1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the iVar"] "note that #yourself from the previous example is now just #methodDict"
This snippet is performed on behalf of every developer who asks for senders and/or implementors. It is, IMHO, the most often used piece of code of every Smalltalk, ever.
Absolutely not. Your claim that "this snippet is performed on behalf of every developer who asks for senders and/or implementors" is misleading. This is not what is *actually* performed. What is actually done is a lot more. For every single mega-morphic send you have dozens of mono-morphic sends.
That is my whole point. Just like in your previous post, you are not using actual code but rather a specifically devised micro-benchmark that has none of the characteristics of actual code. It's not done sending #methodDict - this is when the work starts not when it ends. If you look at the actual code that is executed, say:
MessageTally tallySends:[Time browseAllCallsOn: #yourself]
you will find that when browsing senders there are some 50 messages sent in addition to the single mega-morphic send and it is *those* fifty messages are where the real work is - the single mega-morphic send is simply noise in the overall performance. And it's these fifty messages (which have different performance characteristics) where Strongtalk just completely rulez.
But, even for the latter I expect to find the = in >= when Strongtalk "...executes Smalltalk much faster than any other Smalltalk implementation...".
The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks" (the difference between the two should be obvious). I can always design you a benchmark that makes a particular system look bad.
I would expect that you know how to evaluate the results of a micro-benchmarks, and I would in particular expect that you know that 80-90% of all call-sites in realistic code are mono-morphic to begin with which render your benchmark results absolutely useless for "Smalltalk code".
Absolutely not (yes, I know about these figures. no, I disagree: see above).
If you know about the figures, then how can you claim that your benchmark has any validity for general code? And as I am saying in the above the *actual* code has "Smalltalk performance characteristics" whereas your made-up micro-benchmark doesn't.
Cheers, - Andreas
Hi Andreas,
on Sun, 17 Dec 2006 13:51:53 +0100, you wrote:
Klaus D. Witzel wrote:
O.K. I understand that as lack of use case. Take this (take that ;-) | allCs | allCs := Smalltalk allClasses. "start timing here" 1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the iVar"] "note that #yourself from the previous example is now just #methodDict" This snippet is performed on behalf of every developer who asks for senders and/or implementors. It is, IMHO, the most often used piece of code of every Smalltalk, ever.
Absolutely not. Your claim that "this snippet is performed on behalf of every developer who asks for senders and/or implementors" is misleading. This is not what is *actually* performed.
Of course it is performed, even in reality.
What is actually done is a lot more.
That's you point, accepted, agreed, out (and what about over ;-)
For every single mega-morphic send you have dozens of mono-morphic sends.
That is my whole point.
Agreed, NP.
Just like in your previous post, you are not using actual code
The previous post has a simulation of actual code, I have no doubts (see also my layers thing below, which I hope explains).
but rather a specifically devised micro-benchmark that has none of the characteristics of actual code.
Sending #methodDict to a collection instances of behavior IS reality, sorry. Perhaps you meant something else?
It's not done sending #methodDict - this is when the work starts not when it ends.
But this is "only" your point. My point is, performance is performance. The possible dozens of mono-morphic sends do not amorthisize the bad (in my case) mini-morphic performance.
I agree they could've been responsible for amorthisation of the investment if the figures where true for ">=" but, the latter is apparently not the case.
Hey man, I understand you point. But a PIC of size 8 is not a mega-morphic thing. Let's not take this one any further (if possible, please).
If you look at the actual code that is executed, say:
MessageTally tallySends:[Time browseAllCallsOn: #yourself]
you will find that when browsing senders there are some 50 messages sent in addition to the single mega-morphic send and it is *those* fifty messages are where the real work is - the single mega-morphic send is simply noise in the overall performance.
Well, I used #yourself because a) I was not interested in any particular implementation, which b) has constant response time and c) because it is guaranteed to not choke the test. I was not interested in the leafs, right you are.
In my imagination a system like Smalltalk has several layers. And I timed just one of them and found the results.
And it's these fifty messages (which have different performance characteristics) where Strongtalk just completely rulez.
But, even for the latter I expect to find the = in >= when Strongtalk "...executes Smalltalk much faster than any other Smalltalk implementation...".
The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks" (the difference between the two should be obvious).
Not that I see any difference, I posted Smalltalk code (perhaps you meant something else?)
I can always design you a benchmark that makes a particular system look bad.
C'mon. It's either faster or it's not. No way out.
I would expect that you know how to evaluate the results of a micro-benchmarks, and I would in particular expect that you know that 80-90% of all call-sites in realistic code are mono-morphic to begin with which render your benchmark results absolutely useless for "Smalltalk code".
Absolutely not (yes, I know about these figures. no, I disagree: see above).
If you know about the figures, then how can you claim that your benchmark has any validity for general code?
Did I? I see this rather as a counter example which sheds some light on the performance claim (using your words, sheds some "Klaus Witzel code" light ;-)
And I found some system whose performance didn't pass a simple Thue-Morse sequence test :-)
And as I am saying in the above the *actual* code has "Smalltalk performance characteristics" whereas your made-up micro-benchmark doesn't.
C'mon. Sending messages to elements of collections _is_ characteristic for the Smalltalks.
/Klaus
Cheers,
- Andreas
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
I can always design you a benchmark that makes a particular system look bad.
C'mon. It's either faster or it's not. No way out.
exactly, you're absolutely right there.
The question is which conclusions you draw from such a punctual observation. And the conclusion you initially drew - something along the lines of "I thought Strongtalk was the fastest Smalltalk, whoops, that's not true" - is far too general in the limited light your particular measurement sheds on Strongtalk performance.
I guess this is what it boils down to. Judging an entire system's performance by just one small simple point of observation just doesn't work. (I must admit that this started me in the first place.)
Best,
Michael
Hi Michael,
on Sun, 17 Dec 2006 15:58:12 +0100, you wrote:
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
I can always design you a benchmark that makes a particular system
look
bad.
C'mon. It's either faster or it's not. No way out.
exactly, you're absolutely right there.
The question is which conclusions you draw from such a punctual observation.
Right you are, and so is Andreas.
And the conclusion you initially drew - something along the lines of "I thought Strongtalk was the fastest Smalltalk, whoops, that's not true" - is far too general in the limited light your particular measurement sheds on Strongtalk performance.
Yes, I can now see that my "and now it shows to be in almost the same class as Squeak is" is understood as a strong claim. But it just expresses my disappointment and desillusion.
I guess this is what it boils down to. Judging an entire system's performance by just one small simple point of observation just doesn't work. (I must admit that this started me in the first place.)
No (agreeing with your "just doesn't work"). But the message is clear (reflecting Andreas' point): if you have nothing to inline (etc), then PICs (which can run out of steam) won't help.
No misunderstanding, my point did not change: even if so (have nothing to inline), *faster than all others* is "fast even beyond PICs capabilities". Otherwise it can be contradicted by Andreas (... can always design you a benchmark that makes a particular system look bad... :)
Perhaps there is something to learn from VW (without compromising the existing, I mean). Who knows.
From a pragmatic point of view, if you can't write inlineable (type feedback'able, etc) code (or, as Andreas pointed out: can't write such a test :) , whatever the reason, don't expect a guarantee for superior performance.
/Klaus
Best,
Michael
On Dec 17, 2006, at 16:32 , Klaus D. Witzel wrote:
Yes, I can now see that my "and now it shows to be in almost the same class as Squeak is" is understood as a strong claim. But it just expresses my disappointment and desillusion.
I can understand the desillusion part, although it's hardly surprising. I mean, there are benchmarks where Squeak outperforms even standard C. But disappointment? Hardly, unless you believe in magic ;)
- Bert -
Thank you Bert, you re+setted me up (needless to say, pun intended :)
On Sun, 17 Dec 2006 19:06:55 +0100, Bert Freudenberg wrote:
On Dec 17, 2006, at 16:32 , Klaus D. Witzel wrote:
Yes, I can now see that my "and now it shows to be in almost the same class as Squeak is" is understood as a strong claim. But it just expresses my disappointment and desillusion.
I can understand the desillusion part, although it's hardly surprising. I mean, there are benchmarks where Squeak outperforms even standard C. But disappointment? Hardly, unless you believe in magic ;)
- Bert -
Klaus D. Witzel wrote:
The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks" (the difference between the two should be obvious).
Not that I see any difference, I posted Smalltalk code (perhaps you meant something else?)
Yes, clearly you don't see the difference and this seems to be at the heart of the problem. You are running a micro-benchmark with specific performance characteristics, that are not typical for Smalltalk code in the large. Of course, you are free to make up your own performance characteristics and measure these but that's what I call "Klaus Witzel Benchmarks" - code that has been chosen because it has performance characteristics that you want to measure not the performance characteristics that "Smalltalk code" *typically* has.
The Strongtalk claims are about *typical* Smalltalk performance characteristics, nobody has ever claimed that Strongtalk would run any code with any performance characteristic that anyone could ever come up with faster than other Smalltalks. In particular, there is no claim about "faster polymorphic send performance than any other Smalltalk".
Nevertheless, solely based on this benchmark (which, again, do not reflect typical Smalltalk performance characteristics) you are making outrageous claims like: "I'm sorry to tell that Strongtalk is NOT that fast." or "I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is".
That's what I object to. Your benchmark is absolutely no basis for such far-reaching and (once you do some real benchmarking) obviously false claims. A single micro-benchmark is simply not enough to judge overall performance.
And as I am saying in the above the *actual* code has "Smalltalk performance characteristics" whereas your made-up micro-benchmark doesn't.
C'mon. Sending messages to elements of collections _is_ characteristic for the Smalltalks.
Yes, sending messages to elements of collections is characteristic. But sending messages to elements of *highly polymorphic* collections (which you specifically constructed for the benchmark) is not.
Fortunately, it is very easy to show just how non-characteristic your choice of collection is by looking at an actual image:
lastObj := Object new. nextObj := nil someObject. bag := Bag new. [nextObj == lastObj] whileFalse:[ nextObj isCollection ifTrue:[ set := Set new. nextObj do:[:each| set add: each class]. bag add: set size. ]. nextObj := nextObj nextObject. ]. max := bag size. bag sortedCounts do:[:assoc| Transcript crtab; show: assoc key. Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01) asString,'%): '. Transcript show: assoc value. ].
The result of which is (in a Croquet image I'm doing my work in):
306384 (85.12%): 1 31278 (8.69%): 2 19377 (5.38%): 0 2487 (0.69%): 3 178 (0.04%): 4 51 (0.01%): 5 38 (0.01%): 6 18 (0.0%): 10 17 (0.0%): 7 14 (0.0%): 8 8 (0.0%): 9 [...etc...]
In other words, more than 90% of all the collections (some 350,000 so it's a nice big sample) have at most a single receiver type. 8% have two receiver types. Everything else is noise. If you keep in mind that good amount of the 8% are due to monomorphic collections using Arrays utilizing nil to indicate empty slots the practical percentage of monomorphic collections is probably somewhere between 95-98%.
So no, your benchmark is not characteristic for Smalltalk code.
Cheers, - Andreas
Hi Andreas,
as said earlier I understand your argument, and now I can also appreciate the figures you extracted from a Croquet image. I have something similiar (in terms of extracted figures :) from Squeak 3.9 running morpic. I assume that all morphs are used in the World's steps and looked at what is there most often (expression's code is below). As can be seen I concentrate just on the "collective" aspect, i.e. when #submorphs are fetched. There are 290 senders of #submorph (and 307 accessors of the corresponding iVar, </phew>) and I collect possible call-sites where the PIC's size must be >= 3 (counting that as non-trivial case and to be for sure on distance to your figures).
I so found 1034 non-trivial elements (sum of distinct types [when >= 3] over the 846 morphs [objects which respond to #submorphs]) in my running image (strange things these morphs :) But these figures are not used in the next computation, just for selecting a single subject:
AlignmentMorph (which here has 159 instances and 161 users) looks to be max. Using your (as always excellent!) piece of code, that shows that roughly 33% of them have non-trivial #submorphs. So much for your "noise" from the morphic side ;-)
97 (61.0%): 1 50 (31.44%): 3 6 (3.77%): 0 3 (1.88%): 2 3 (1.88%): 4
/Klaus
P.S. please note that in your post you compared all collections with distribution of omega types to my smaller collection with distribution of 8 types. We already remarked (have we?) that it makes no sense to compare [for example] the collection of all classes (b/o 100% distinct types). Same goes with the other dimension, IINM.
--------- Figures produced with: --------- | bag max | bag := Bag new. AlignmentMorph allInstances asArray collect: [:each | bag add: (each submorphs collect: [:object | object class]) asIdentitySet size]. "Andreas' code follows" max := bag size. bag sortedCounts do:[:assoc| Transcript crtab; show: assoc key. Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01) asString,'%): '. Transcript show: assoc value] --------- AlignmentMorph seems to be max: --------- | morphInstances subtypeMorphs distinctTypes submorphs minTypes maxTypes outliners | morphInstances := subtypeMorphs := 0. minTypes := Smalltalk size. maxTypes := 0. distinctTypes := IdentitySet new: 1000. outliners := IdentitySet new: 100. Smalltalk garbageCollect; garbageCollect. "count # of submorphs containers and their distinct submorphs' type(s)" SystemNavigation default allObjectsDo: [:object | ((object respondsTo: #submorphs) and: [(submorphs := object submorphs) notNil and: [ submorphs isEmpty not]]) ifTrue: [ distinctTypes do: [:key | distinctTypes remove: key]. submorphs do: [:each | distinctTypes add: each class name]. morphInstances := morphInstances + 1. subtypeMorphs := subtypeMorphs + (submorphs := distinctTypes size). minTypes := minTypes min: submorphs. maxTypes := maxTypes max: submorphs. submorphs >= 3 ifTrue: [outliners add: object class name]. ]]. "determine the popularity of morphs with most # of distinct submorphs" outliners := outliners asArray collect: [:each | each -> (SystemNavigation default allCallsOn: (distinctTypes := Smalltalk associationAt: each)) size -> distinctTypes value instanceCount]. ^ {morphInstances. subtypeMorphs. minTypes. maxTypes} , outliners asArray ---------
On Sun, 17 Dec 2006 20:32:51 +0100, Andreas Raab wrote:
Klaus D. Witzel wrote:
The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks" (the difference between the two should be obvious).
Not that I see any difference, I posted Smalltalk code (perhaps you meant something else?)
Yes, clearly you don't see the difference and this seems to be at the heart of the problem. You are running a micro-benchmark with specific performance characteristics, that are not typical for Smalltalk code in the large. Of course, you are free to make up your own performance characteristics and measure these but that's what I call "Klaus Witzel Benchmarks" - code that has been chosen because it has performance characteristics that you want to measure not the performance characteristics that "Smalltalk code" *typically* has.
The Strongtalk claims are about *typical* Smalltalk performance characteristics, nobody has ever claimed that Strongtalk would run any code with any performance characteristic that anyone could ever come up with faster than other Smalltalks. In particular, there is no claim about "faster polymorphic send performance than any other Smalltalk".
Nevertheless, solely based on this benchmark (which, again, do not reflect typical Smalltalk performance characteristics) you are making outrageous claims like: "I'm sorry to tell that Strongtalk is NOT that fast." or "I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is".
That's what I object to. Your benchmark is absolutely no basis for such far-reaching and (once you do some real benchmarking) obviously false claims. A single micro-benchmark is simply not enough to judge overall performance.
And as I am saying in the above the *actual* code has "Smalltalk performance characteristics" whereas your made-up micro-benchmark doesn't.
C'mon. Sending messages to elements of collections _is_ characteristic for the Smalltalks.
Yes, sending messages to elements of collections is characteristic. But sending messages to elements of *highly polymorphic* collections (which you specifically constructed for the benchmark) is not.
Fortunately, it is very easy to show just how non-characteristic your choice of collection is by looking at an actual image:
lastObj := Object new. nextObj := nil someObject. bag := Bag new. [nextObj == lastObj] whileFalse:[ nextObj isCollection ifTrue:[ set := Set new. nextObj do:[:each| set add: each class]. bag add: set size. ]. nextObj := nextObj nextObject. ]. max := bag size. bag sortedCounts do:[:assoc| Transcript crtab; show: assoc key. Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01) asString,'%): '. Transcript show: assoc value. ].
The result of which is (in a Croquet image I'm doing my work in):
306384 (85.12%): 1 31278 (8.69%): 2 19377 (5.38%): 0 2487 (0.69%): 3 178 (0.04%): 4 51 (0.01%): 5 38 (0.01%): 6 18 (0.0%): 10 17 (0.0%): 7 14 (0.0%): 8 8 (0.0%): 9 [...etc...]
In other words, more than 90% of all the collections (some 350,000 so it's a nice big sample) have at most a single receiver type. 8% have two receiver types. Everything else is noise. If you keep in mind that good amount of the 8% are due to monomorphic collections using Arrays utilizing nil to indicate empty slots the practical percentage of monomorphic collections is probably somewhere between 95-98%.
So no, your benchmark is not characteristic for Smalltalk code.
Cheers,
- Andreas
Hi Klaus,
I haven't yet reproduced the benchmarks, but I'd never judge the performance of an entire implementation based on one single simplistic benchmark. Sorry, this sounds very negative, I don't mean to pick on you.
The benchmark you have run is a micro-benchmark that measures the performance of just one point of interest. Basically, it measures the performance of sending #yourself to various different objects, instances of 8 different classes. I believe that #yourself is implemented in Object and never overwritten.
(Having 8 different classes doesn't exceed typical PICs; they normally have 8 entries, if I'm not mistaken. The benchmark could really stress the VM if much more than 8 different classes were chosen - but in the end, it would be more interesting to have actual different *implementations* of a message, because the VM can quite easily determine that the implementation for #yourself is the same for all objects.)
In a nutshell, micro-benchmarks are fine but should be more diverse. Measure - monomorphic call sites (just one target), - polymorphic call sites (small number of different targets), and - megamorphic call sites (very large number of different targets).
The results of all of these together would tell more.
Also, an optimising VM normally takes some time to start optimising - before the adaptive optimisation logic sees that there are some "hot spots", usually the interpreter has to execute stuff for some time. Of course, this doesn't hold for Squeak.
And once the VM has started optimising, there is still some impact due to optimisation (it consumes time as well!). You normally let the benchmark run several times until you can be sure that the VM has applied all optimisations and measure the performance yielded by this "steady state". This results in numbers that report only actual performance instead of VM and optimisation interference.
I wonder whether there is something like SPECjvm98 for Smalltalk systems.
Of course, we also shouldn't forget that Strongtalk has not been developed for some 10 years now, whereas VisualWorks has been constantly maintained by at least one VM guru. ;-)
Best,
Michael
Thank you Michael for your illustrative response.
I had taken most of the steps you mention, before posting, but the one with stress on a small PIC size was rather unexpected. Perhaps it would be interesting to find out the actual limit and try again. But the performance of this mini-morphic situation is not convincing.
I intentionally used instances of classes which inherit from each other: this is the typical situation when processing collections - regardless of using Traits. And yes, as you mention it csn be interesting to have actual different *implementations* of a message. But I doubt that there will be remarkable difference, since the methodCache is per receiver class (and so IMO there is no change for the example in my previous post).
Thanks again.
/Klaus
On Sun, 17 Dec 2006 12:04:21 +0100, Michael Haupt wrote:
Hi Klaus,
I haven't yet reproduced the benchmarks, but I'd never judge the performance of an entire implementation based on one single simplistic benchmark. Sorry, this sounds very negative, I don't mean to pick on you.
The benchmark you have run is a micro-benchmark that measures the performance of just one point of interest. Basically, it measures the performance of sending #yourself to various different objects, instances of 8 different classes. I believe that #yourself is implemented in Object and never overwritten.
(Having 8 different classes doesn't exceed typical PICs; they normally have 8 entries, if I'm not mistaken. The benchmark could really stress the VM if much more than 8 different classes were chosen - but in the end, it would be more interesting to have actual different *implementations* of a message, because the VM can quite easily determine that the implementation for #yourself is the same for all objects.)
In a nutshell, micro-benchmarks are fine but should be more diverse. Measure
- monomorphic call sites (just one target),
- polymorphic call sites (small number of different targets), and
- megamorphic call sites (very large number of different targets).
The results of all of these together would tell more.
Also, an optimising VM normally takes some time to start optimising - before the adaptive optimisation logic sees that there are some "hot spots", usually the interpreter has to execute stuff for some time. Of course, this doesn't hold for Squeak.
And once the VM has started optimising, there is still some impact due to optimisation (it consumes time as well!). You normally let the benchmark run several times until you can be sure that the VM has applied all optimisations and measure the performance yielded by this "steady state". This results in numbers that report only actual performance instead of VM and optimisation interference.
I wonder whether there is something like SPECjvm98 for Smalltalk systems.
Of course, we also shouldn't forget that Strongtalk has not been developed for some 10 years now, whereas VisualWorks has been constantly maintained by at least one VM guru. ;-)
Best,
Michael
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
But I doubt that there will be remarkable difference, since the methodCache is per receiver class
I don't understand that bit. What do you mean by it? In case of PICs, the cache is per send site.
Best,
Michael
Hi Michael,
on Sun, 17 Dec 2006 14:03:24 +0100, you wrote:
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
But I doubt that there will be remarkable difference, since the methodCache is per receiver class
I don't understand that bit. What do you mean by it? In case of PICs, the cache is per send site.
In VW and in Squeak there's no PIC and the corresponding thing which contributes to performance I called "methodCache" (still in the context of comparision).
I also believe (without having searched for it) that, when the PIC is exhausted (or not in use at bytecode time) the comparision (Squeak/VM/Strongtalk) reflects the "methodCache" thing performance.
Best,
Michael
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
In VW and in Squeak there's no PIC and the corresponding thing which contributes to performance I called "methodCache" (still in the context of comparision).
Squeak of course doesn't have PICs, but I was pretty sure VisualWorks had.
Best,
Michael
Hi Michael,
on Sun, 17 Dec 2006 14:44:01 +0100, you wrote:
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
In VW and in Squeak there's no PIC and the corresponding thing which contributes to performance I called "methodCache" (still in the context of comparision).
Squeak of course doesn't have PICs, but I was pretty sure VisualWorks had.
Will have a look if that is responsible for the figures :)
Best,
Michael
Klaus,
There are three issues here:
1) You did *not* run it enough under Strongtalk to compile the benchmark, so you are measuring interpreted performance. You need to run it until the performance speeds up and stabilizes. When it is compiled, on my machine (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and Strongtalk runs it in 22139. That's not the latest Squeak but I doubt it has changed much. I don't have a recent VisualWorks installed, but from my knowledge of how the various systems work, I would expect VisualWorks to be a bit faster than Strongtalk at this (very poor) microbenchmark, for reasons explained below.
2) Andreas Raab was right in his comments. The performance you are measuring is *not* general Smalltalk performance, it is specifically the performance of megamorphic sends, which are one of the few cases where Strongtalk's type-feedback doesn't help at all.
Here is how sends work in Strongtalk:
Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at the send site) can be inlined, which is the common case (over 90% of sends fall in this category), and that is where Strongtalk can give you big speedups.
Sends that have between 2 and 4 receiver classes are usually handled with a polymorphic inline cache (PIC), which is still a real dispatch and call, and is only slightly faster (if at all) than in other Smalltalks, since that is the most highly optimized piece of code in any normal Smalltalk implementation. PICs are not primarily for optimization; their real role is to gather type information for the inlining compiler. Note that VisualWorks now has PICs, so it uses the same technology for non-inlined sends as Strongtalk.
Sends that have more than 4 receiver types, such as your micro-benchmark, can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all. You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively rare as a percentage of sends. Compilers aren't magic- no one can eliminate the fundamental computation that a truly megamorphic send has to do- it *has* to do some kind of real lookup, and a call, so the performance will naturally be similar across all Smalltalks.
Every Smalltalk has that overhead. What Strongtalk does is eliminate that overhead when you don't really need it, when a send doesn't actually have many receiver classes. That is what other Smalltalk's can't do: they make you pay the cost of a dispatch and call all the time, even if you don't need it, which is the common case.
So your 'picBench' isn't even measuring PIC performance.
3) I would expect VisualWorks to be about the same speed or a bit faster than Strongtalk on this atypical benchmark because of several factors. We have established that type-feedback doesn't help this benchmark, so from the point of view of sends, VisualWorks and Strongtalk would be doing basically the same kind of things. The reason VisualWorks would probably be a bit faster on this benchmark is because it probably does array bounds-check elimination and maybe even loop unrolling, which aren't yet implemented in Strongtalk, and I'm sure aren't implemented in Squeak. We did those in the Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't even really been tuned, and VisualWorks has been tuned for many years. Your benchmark consists of a tight inner loop that does only two things: a megamorphic send, and an array lookup. So the array bounds check and loop overhead are a significant factor, and if VisualWorks can optimize those, it would make a real difference.
But once again, this is not even remotely typical Smalltalk code. Array bounds-checks and loop unrolling are rarely used optimizations that generally only help when you have a very tight inner loop that does almost nothing and where the loop itself is a literal SmallInteger>>to:do: send, you are accessing an array, and the array access is literally imbedded in the loop, not in a called method. How much of your code really looks like that?
-Dave
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
Folks,
I'm sorry to tell that Strongtalk is NOT that fast. I followed the instructions and *compiled* the following benchmark in Strongtalk, evaluated the same expression in Squeak and in VW and got the these results on my 1.73GHz 1.0GB WinXP notebook:
- VisualWorks: 16799 (N.C. 7.4.1)
- Strongtalk: 47517 (1.1.2)
- Squeak: 56726 (3.9#7056)
Below is the Squeak/VW source code, attached is the Strongtalk source code. The test is simple: a long loop around a single polymorphic call site "(instances at: i) yourself", straight forward inlineable and with intentionally unpredictable type information at the call site (modeled after the Thue-Morse sequence).
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :(
Can somebody reproduce the figures, any other results? Have I done something wrong?
BTW: congrats to the implementors of Squeak and, of course, to Cincom! (uhm, and also to the Strongtalk team!)
/Klaus
| instances base | base := (Array with: OrderedCollection basicNew with: SequenceableCollection basicNew with: Collection basicNew with: Object basicNew) , (Array with: Character space with: Date basicNew with: Time basicNew with: Magnitude basicNew). instances := OrderedCollection with: (base at: 1). 2 to: base size do: [:i | instances := instances , instances reverse. instances addLast: (base at: i)]. instances := (instances , instances reverse) asArray. ^ Time millisecondsToRun: [ 1234567 timesRepeat: [ 1 to: instances size do: [:i | (instances at: i) yourself]]]
Thank you David for answering my question
Can somebody reproduce the figures, any other results? Have I done something wrong?
and thank you also for the explanations. I understand that PICs in Strongtalk are [in the current incarnation] limited to 4 entries, that's good to know.
Just a minor adjustment: the #at: on the array was never in doubt and the integer loop was by intention because (I think) on all three systems it's compiled away already at the bytecode level and the #at: is expected to be subsummed at the primitive level. I've seen walkbacks in Strongtalk in which the source code #to:do: was inlined with #whileTrue sans block, like in Squeak.
As to you figures, will retry with a "warmer" image :)
And I have nothing against people calling my test a poor benchmark. I wanted to compare the performance at this particular level and according to your report even there [the at this level unoptimized] Strongtalk is close to VW. And no, I would never say that mega-morphic sends is all what Smalltalk is about.
Let me comment this one
... How much of your code really looks like that?
Well, at that level almost all users of collection #do: look like that. I just made the level below an O(1) constant, otherwise the polymorphic nature of "(array at: i) doSomethingPolymorphically" would perhaps have gone unnoticed.
Thanks again, very insightful.
/Klaus
On Mon, 18 Dec 2006 00:08:08 +0100, David Griswold david.griswold.256@gmail.com wrote:
Klaus,
There are three issues here:
- You did *not* run it enough under Strongtalk to compile the
benchmark, so you are measuring interpreted performance. You need to run it until the performance speeds up and stabilizes. When it is compiled, on my machine (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and Strongtalk runs it in 22139. That's not the latest Squeak but I doubt it has changed much. I don't have a recent VisualWorks installed, but from my knowledge of how the various systems work, I would expect VisualWorks to be a bit faster than Strongtalk at this (very poor) microbenchmark, for reasons explained below.
- Andreas Raab was right in his comments. The performance you are
measuring is *not* general Smalltalk performance, it is specifically the performance of megamorphic sends, which are one of the few cases where Strongtalk's type-feedback doesn't help at all.
Here is how sends work in Strongtalk:
Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at the send site) can be inlined, which is the common case (over 90% of sends fall in this category), and that is where Strongtalk can give you big speedups.
Sends that have between 2 and 4 receiver classes are usually handled with a polymorphic inline cache (PIC), which is still a real dispatch and call, and is only slightly faster (if at all) than in other Smalltalks, since that is the most highly optimized piece of code in any normal Smalltalk implementation. PICs are not primarily for optimization; their real role is to gather type information for the inlining compiler. Note that VisualWorks now has PICs, so it uses the same technology for non-inlined sends as Strongtalk.
Sends that have more than 4 receiver types, such as your micro-benchmark, can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all. You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively rare as a percentage of sends. Compilers aren't magic- no one can eliminate the fundamental computation that a truly megamorphic send has to do- it *has* to do some kind of real lookup, and a call, so the performance will naturally be similar across all Smalltalks.
Every Smalltalk has that overhead. What Strongtalk does is eliminate that overhead when you don't really need it, when a send doesn't actually have many receiver classes. That is what other Smalltalk's can't do: they make you pay the cost of a dispatch and call all the time, even if you don't need it, which is the common case.
So your 'picBench' isn't even measuring PIC performance.
- I would expect VisualWorks to be about the same speed or a bit faster
than Strongtalk on this atypical benchmark because of several factors. We have established that type-feedback doesn't help this benchmark, so from the point of view of sends, VisualWorks and Strongtalk would be doing basically the same kind of things. The reason VisualWorks would probably be a bit faster on this benchmark is because it probably does array bounds-check elimination and maybe even loop unrolling, which aren't yet implemented in Strongtalk, and I'm sure aren't implemented in Squeak. We did those in the Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't even really been tuned, and VisualWorks has been tuned for many years. Your benchmark consists of a tight inner loop that does only two things: a megamorphic send, and an array lookup. So the array bounds check and loop overhead are a significant factor, and if VisualWorks can optimize those, it would make a real difference.
But once again, this is not even remotely typical Smalltalk code. Array bounds-checks and loop unrolling are rarely used optimizations that generally only help when you have a very tight inner loop that does almost nothing and where the loop itself is a literal SmallInteger>>to:do: send, you are accessing an array, and the array access is literally imbedded in the loop, not in a called method. How much of your code really looks like that?
-Dave
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
Folks,
I'm sorry to tell that Strongtalk is NOT that fast. I followed the instructions and *compiled* the following benchmark in Strongtalk, evaluated the same expression in Squeak and in VW and got the these results on my 1.73GHz 1.0GB WinXP notebook:
- VisualWorks: 16799 (N.C. 7.4.1)
- Strongtalk: 47517 (1.1.2)
- Squeak: 56726 (3.9#7056)
Below is the Squeak/VW source code, attached is the Strongtalk source code. The test is simple: a long loop around a single polymorphic call site "(instances at: i) yourself", straight forward inlineable and with intentionally unpredictable type information at the call site (modeled after the Thue-Morse sequence).
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :(
Can somebody reproduce the figures, any other results? Have I done something wrong?
BTW: congrats to the implementors of Squeak and, of course, to Cincom! (uhm, and also to the Strongtalk team!)
/Klaus
| instances base | base := (Array with: OrderedCollection basicNew with: SequenceableCollection basicNew with: Collection basicNew with: Object basicNew) , (Array with: Character space with: Date basicNew with: Time basicNew with: Magnitude basicNew). instances := OrderedCollection with: (base at: 1). 2 to: base size do: [:i | instances := instances , instances reverse. instances addLast: (base at: i)]. instances := (instances , instances reverse) asArray. ^ Time millisecondsToRun: [ 1234567 timesRepeat: [ 1 to: instances size do: [:i | (instances at: i) yourself]]]
Hi Klaus,
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
Thank you David for answering my question
Can somebody reproduce the figures, any other results? Have I done something wrong?
and thank you also for the explanations. I understand that PICs in Strongtalk are [in the current incarnation] limited to 4 entries, that's good to know.
Just a minor adjustment: the #at: on the array was never in doubt and the integer loop was by intention because (I think) on all three systems it's compiled away already at the bytecode level and the #at: is expected to be subsummed at the primitive level. I've seen walkbacks in Strongtalk in which the source code #to:do: was inlined with #whileTrue sans block, like in Squeak.
Yes, #to:do: is treated specially by the bytecode compiler, although it doesn't really have to be, since type-feedback would be able to inline and eliminate the block. The only reason it is treated specially is just so it still runs reasonable fast in the interpreter, before methods are compiled, because it is so important in inner loops. #at:, on the other hand, is not treated specially in Strongtalk, unlike most other Smalltalks.
As to you figures, will retry with a "warmer" image :)
And I have nothing against people calling my test a poor benchmark. I wanted to compare the performance at this particular level and according to your report even there [the at this level unoptimized] Strongtalk is close to VW. And no, I would never say that mega-morphic sends is all what Smalltalk is about.
Let me comment this one
... How much of your code really looks like that?
Well, at that level almost all users of collection #do: look like that. I just made the level below an O(1) constant, otherwise the polymorphic nature of "(array at: i) doSomethingPolymorphically" would perhaps have gone unnoticed.
#do: loops are significantly different, because 1) they are not treated specially by the bytecode compiler, so there is a real block and usually a closure in most Smalltalks, 2) the implementation of #do:, which is where the inner loop might be, does not literally contain the body of the loop, so loop unrolling can't be applied by a non-inlining Smalltalk. Array bounds-check elimination might apply, but when the loop contains more than a few sends (including the additional Block>>value: send), the benefits rapidly become minor.
So in fact, a #do: benchmark (with a block that needs a closure, since all real #do: sends need a closure) would be a much better benchmark, because it's the way people actually write code, and sure enough Strongtalk can both inline the #do: implementation, and inline the block into the loop, so it would show much bigger advantages compared to other Smalltalks. And even that would understate the potential Strongtalk advantage, because if the compiler was tuned, it would be able to do bounds-check elimination and loop unrolling even for #do:, because it can inline the block, whereas VisualWorks would never be able to.
Cheers, Dave
Thanks again, very insightful.
/Klaus
On Mon, 18 Dec 2006 00:08:08 +0100, David Griswold david.griswold.256@gmail.com wrote:
Klaus,
There are three issues here:
- You did *not* run it enough under Strongtalk to compile the
benchmark, so you are measuring interpreted performance. You need to run it until the performance speeds up and stabilizes. When it is compiled, on my machine (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and Strongtalk runs it in 22139. That's not the latest Squeak but I doubt
it
has changed much. I don't have a recent VisualWorks installed, but from my knowledge of how the various systems work, I would expect VisualWorks to be a bit faster than Strongtalk at this (very poor) microbenchmark, for reasons explained below.
- Andreas Raab was right in his comments. The performance you are
measuring is *not* general Smalltalk performance, it is specifically the performance of megamorphic sends, which are one of the few cases where Strongtalk's type-feedback doesn't help at all.
Here is how sends work in Strongtalk:
Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at the send site) can be inlined, which is the common case (over 90% of sends fall in this category), and that is where Strongtalk can give you big speedups.
Sends that have between 2 and 4 receiver classes are usually handled with a polymorphic inline cache (PIC), which is still a real dispatch and call, and is only slightly faster (if at all) than in other Smalltalks, since that is the most highly optimized piece of code in any normal Smalltalk implementation. PICs are not primarily for optimization; their real role is to gather type information for the inlining compiler. Note that VisualWorks now has PICs, so it uses the same technology for non-inlined sends as Strongtalk.
Sends that have more than 4 receiver types, such as your
micro-benchmark,
can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all. You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively rare as a percentage of sends. Compilers aren't magic- no one can eliminate the fundamental computation that a truly megamorphic send has to do- it *has* to do some kind of real lookup, and a call, so the performance will naturally be similar across all Smalltalks.
Every Smalltalk has that overhead. What Strongtalk does is eliminate that overhead when you don't really need it, when a send doesn't actually
have
many receiver classes. That is what other Smalltalk's can't do: they make you pay the cost of a dispatch and call all the time, even if you don't need it, which is the common case.
So your 'picBench' isn't even measuring PIC performance.
- I would expect VisualWorks to be about the same speed or a bit faster
than Strongtalk on this atypical benchmark because of several factors. We have established that type-feedback doesn't help this benchmark, so from the point of view of sends, VisualWorks and Strongtalk would be doing basically the same kind of things. The reason VisualWorks would probably be a bit faster on this benchmark is because it probably does array bounds-check elimination and maybe even loop unrolling, which aren't yet implemented in Strongtalk, and I'm sure aren't implemented in Squeak. We did those in the Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't even really been tuned, and VisualWorks has been tuned for many years. Your benchmark consists of a tight inner loop that does only two things: a megamorphic send, and an array lookup. So the array bounds check and loop overhead are a significant factor, and if VisualWorks can optimize those, it would make a real difference.
But once again, this is not even remotely typical Smalltalk code. Array bounds-checks and loop unrolling are rarely used optimizations that generally only help when you have a very tight inner loop that does almost nothing and where the loop itself is a literal SmallInteger>>to:do:
send,
you are accessing an array, and the array access is literally imbedded
in
the loop, not in a called method. How much of your code really looks like that?
-Dave
On 12/17/06, Klaus D. Witzel klaus.witzel@cobss.com wrote:
Folks,
I'm sorry to tell that Strongtalk is NOT that fast. I followed the instructions and *compiled* the following benchmark in Strongtalk, evaluated the same expression in Squeak and in VW and got the these results on my 1.73GHz 1.0GB WinXP notebook:
- VisualWorks: 16799 (N.C. 7.4.1)
- Strongtalk: 47517 (1.1.2)
- Squeak: 56726 (3.9#7056)
Below is the Squeak/VW source code, attached is the Strongtalk source code. The test is simple: a long loop around a single polymorphic call site "(instances at: i) yourself", straight forward inlineable and with intentionally unpredictable type information at the call site (modeled after the Thue-Morse sequence).
I'm disappointed, Strongtalk was always advertised as being the fastest Smalltalk available "...executes Smalltalk much faster than any other Smalltalk implementation...", and now it shows to be in almost the same class as Squeak is :) :(
Can somebody reproduce the figures, any other results? Have I done something wrong?
BTW: congrats to the implementors of Squeak and, of course, to Cincom! (uhm, and also to the Strongtalk team!)
/Klaus
| instances base | base := (Array with: OrderedCollection basicNew with: SequenceableCollection basicNew with: Collection basicNew with: Object basicNew) , (Array with: Character space with: Date basicNew with: Time basicNew with: Magnitude basicNew). instances := OrderedCollection with: (base at: 1). 2 to: base size do: [:i | instances := instances , instances reverse. instances addLast: (base at: i)]. instances := (instances , instances reverse) asArray. ^ Time millisecondsToRun: [ 1234567 timesRepeat: [ 1 to: instances size do: [:i | (instances at: i) yourself]]]
David Griswold writes:
Sends that have more than 4 receiver types, such as your micro-benchmark, can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all. You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively rare as a percentage of sends. Compilers aren't magic- no one can eliminate the fundamental computation that a truly megamorphic send has to do- it *has* to do some kind of real lookup, and a call, so the performance will naturally be similar across all Smalltalks.
I'm fairly sure that VisualWorks has a hash PIC that it uses for mega-morphic sends. Eliot talked about this at Smalltalk Solutions. I also doubt that VW does any advanced optimizations such as global code motion (moving type-checks out of loops) or loop unrolling. If it did it would be faster than Exupery for the bytecode benchmark.
However, in this case if you're actually compiling your benchmark in Strongtalk it's possible that the performance difference between VW and Strongtalk is the method specialization done by Strongtalk.
Strongtalk, AFAIK, compiles a version for each receiver for a method. This is an optimization because it allows more precise type information to be gathered as it's not polluted by other classes use of an inherited method. Specializing methods by receiver should also allow faster inlining of self sends as they can be fully resolved at compile time. (1)
Having a separate compiled method for every receiver may be doing bad things to your CPU's instruction cache. That could be where Strongtalk's lack of performance here is coming from. First level instruction caches are small, the largest on a desktop CPU is only 64Kb. If you want to find out then it is possible to measure cache misses unfortunately I only know how to do this under Linux.
Microbenchmarks are getting less reliable as compilers and hardware becomes smarter.
Bryce
(1) Exupery also compiles a version of each method for each receiver. It does this to allow it to compile specialised versions of the #at: and #new primitives. Specialising is often the right thing to do, especially if you plan to inline methods.
A fully tuned compiler might, but might not, only specialise methods when it helps. However in general it may cost more to figure out when it helps to specialise than it costs to always specialise. Without extensive macro benchmarking it is dangerous to guess.
On 12/18/06, bryce@kampjes.demon.co.uk bryce@kampjes.demon.co.uk wrote:
David Griswold writes:
Sends that have more than 4 receiver types, such as your
micro-benchmark,
can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all. You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively
rare
as a percentage of sends. Compilers aren't magic- no one can eliminate
the
fundamental computation that a truly megamorphic send has to do- it
*has* to
do some kind of real lookup, and a call, so the performance will
naturally
be similar across all Smalltalks.
I'm fairly sure that VisualWorks has a hash PIC that it uses for mega-morphic sends. Eliot talked about this at Smalltalk Solutions. I also doubt that VW does any advanced optimizations such as global code motion (moving type-checks out of loops) or loop unrolling. If it did it would be faster than Exupery for the bytecode benchmark.
I don't know exactly the details on how VW's hash PICs work, but I think my original comment holds: since both Strongtalk and VW do hashing for megamorphic sends, and type-feedback doesn't help Strongtalk for this case, I would expect them to be fairly similar in performance, modulo standard code quality issues that would reflect the level of tuning in the compiler.
It is quite possible that VW doesn't do loop unrolling (which is why my original post put less confidence on that), but I am pretty sure they do array bounds-check removal, which as I said I would expect would account for a good chunk of any performance difference (although at the moment we don't actually have comparative numbers, since no one has run both VW and compiled Strongtalk on the same machine on this benchmark).
Strongtalk should be able to move the Array access type-test out of the loop; I had assumed that VW could do that too, since it seems like a relatively easy thing to do.
However, in this case if you're actually compiling your benchmark in
Strongtalk it's possible that the performance difference between VW and Strongtalk is the method specialization done by Strongtalk.
Strongtalk, AFAIK, compiles a version for each receiver for a method. This is an optimization because it allows more precise type information to be gathered as it's not polluted by other classes use of an inherited method. Specializing methods by receiver should also allow faster inlining of self sends as they can be fully resolved at compile time. (1)
Having a separate compiled method for every receiver may be doing bad things to your CPU's instruction cache. That could be where Strongtalk's lack of performance here is coming from. First level instruction caches are small, the largest on a desktop CPU is only 64Kb. If you want to find out then it is possible to measure cache misses unfortunately I only know how to do this under Linux.
I doubt the instruction cache is the issue here, since the only customized methods involved are a few different versions of #yourself, which does nothing but return self, so the methods should only be a few instructions long. It should take a lot more than that to thrash the instruction cache.
And in general in Strongtalk, the code duplication caused by customization is counteracted by the fact that only hotspot code is compiled in the first place, unlike VW. The entire compiled code cache in Strongtalk for all code in the image is rarely bigger than 2-4 megabytes total, which is probably smaller than VW's code cache. There is probably a bit more instruction cache pressure in Strongtalk, but we've never seen anything that looked like a performance hit because of it, since all that really matters is whether the inner-loop working set of the moment set thrashes or not, not the whole code cache.
Microbenchmarks are getting less reliable as compilers and hardware
becomes smarter.
Absolutely!
Cheers, Dave
squeak-dev@lists.squeakfoundation.org