Hi,
I'm working on profiling squeak/etoys for the olpc project.
It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution.
Here is a graph for reference (more games in Etoys have the same pattern): http://www.bodhidharma.info/instructions.pdf
What I did so far is changing the `256 cases switch' statement to an array of function pointers like (*exec_bytecode_funcs)[bytecode_id]( ...shared data... ); The process was automated with a python script and that needed a little bit of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
if we assume a constant time T both for the execution of every [if !(right_case) jump next_case] in the old code and for the function pointer dereference in the new code there is a 10000% gain for the task of calling the right action for a given bytecode (also given the distribution showed in the graph linked above).
In my tests this is a 20/30% win over the interpret routine timings for different games in etoys.
I appreciate any comments on this work. Maybe it could be done better than this ?
Thanks, Riccardo
Riccardo Lucchese wrote:
Hi,
I'm working on profiling squeak/etoys for the olpc project.
It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution.
Here is a graph for reference (more games in Etoys have the same pattern): http://www.bodhidharma.info/instructions.pdf
What I did so far is changing the `256 cases switch' statement to an array of function pointers like (*exec_bytecode_funcs)[bytecode_id]( ...shared data... ); The process was automated with a python script and that needed a little bit of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
if we assume a constant time T both for the execution of every [if !(right_case) jump next_case] in the old code and for the function pointer dereference in the new code there is a 10000% gain for the task of calling the right action for a given bytecode (also given the distribution showed in the graph linked above).
In my tests this is a 20/30% win over the interpret routine timings for different games in etoys.
Is that 20/30 % overall speedup ? If so, it is great!
Karl
I appreciate any comments on this work. Maybe it could be done better than this ?
Thanks, Riccardo
On Sat, Apr 19, 2008 at 1:48 PM, karl karl.ramberg@comhem.se wrote:
Is that 20/30 % overall speedup ? If so, it is great!
Karl
I'm working on one function at time as they step up in the profiling results.
Cannot say what the overall improvements exactly is; in the patched code execution time is dominated by primitiveWarpBits instead of the interpret function. As those two functions are largely the most cpu intensive ones I think that the overall speedup is quite sensible.
Anybody already working on performances in squeak ?; I'd really like to help in this part of the platform !
Riccardo
Riccardo Lucchese wrote:
On Sat, Apr 19, 2008 at 1:48 PM, karl karl.ramberg@comhem.se wrote:
Is that 20/30 % overall speedup ? If so, it is great!
Karl
I'm working on one function at time as they step up in the profiling results.
Cannot say what the overall improvements exactly is; in the patched code execution time is dominated by primitiveWarpBits instead of the interpret function. As those two functions are largely the most cpu intensive ones I think that the overall speedup is quite sensible.
Anybody already working on performances in squeak ?; I'd really like to help in this part of the platform !
Your effort is greatly appreciated, A faster VM is a good thing. I think the VM list is where primitive and plugin questions are mostly addressed: http://lists.squeakfoundation.org/mailman/listinfo/vm-dev Karl
Hi!
Quoting Riccardo Lucchese riccardo.lucchese@gmail.com:
Hi,
I'm working on profiling squeak/etoys for the olpc project.
It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution.
Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch).
Here is a graph for reference (more games in Etoys have the same pattern): http://www.bodhidharma.info/instructions.pdf
What I did so far is changing the `256 cases switch' statement to an array of function pointers like (*exec_bytecode_funcs)[bytecode_id]( ...shared data... );
Actually it should be slower, because function calls used to be slower than simple jumps or arithmetic instructions.
The process was automated with a python script and that needed a little bit of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better.
if we assume a constant time T both for the execution of every [if !(right_case) jump next_case] in the old code and for the function pointer dereference in the new code there is a 10000% gain for the task of calling the right action for a given bytecode (also given the distribution showed in the graph linked above).
This gain didn't come because of the jump table.
In my tests this is a 20/30% win over the interpret routine timings for different games in etoys.
Are you sure that the performance gain came from these changes? If so, then the only reason for the speedup i can think of is that the most common bytecodes' code are scattered across memory pages and there are more page faults with the switch-case implementation. I wonder if you could check how many page faults does the two implementations have.
Cheers, Levente
I appreciate any comments on this work. Maybe it could be done better than this ?
Thanks, Riccardo
2008/4/19 Levente Uzonyi leves@elte.hu:
Hi!
Quoting Riccardo Lucchese riccardo.lucchese@gmail.com:
Hi,
I'm working on profiling squeak/etoys for the olpc project.
It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution.
Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch).
Here is a graph for reference (more games in Etoys have the same pattern): http://www.bodhidharma.info/instructions.pdf
What I did so far is changing the `256 cases switch' statement to an array of function pointers like (*exec_bytecode_funcs)[bytecode_id]( ...shared data... );
Actually it should be slower, because function calls used to be slower than simple jumps or arithmetic instructions.
The process was automated with a python script and that needed a little
bit
of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better.
Moreover, there are already a script which does exacly same what Riccardo did. See gnuify step, which transforms interp.c to gnu-interp.c and replaces all cases with jump labels and converts code which uses jump table. The reason why it fails, can be that you did some modifications to code, which makes this script fail to find exact patterns in generated sources. But if you following common procedure for generating VM (VMMaker -> make), it should work ok.
if we assume a constant time T both for the execution of every [if !(right_case) jump next_case] in the old code and for the function pointer dereference in the new code there is a 10000% gain for the task of calling the right action for a given bytecode (also given the distribution showed in the graph linked above).
This gain didn't come because of the jump table.
In my tests this is a 20/30% win over the interpret routine timings for different games in etoys.
Are you sure that the performance gain came from these changes? If so, then the only reason for the speedup i can think of is that the most common bytecodes' code are scattered across memory pages and there are more page faults with the switch-case implementation. I wonder if you could check how many page faults does the two implementations have.
Cheers, Levente
I appreciate any comments on this work. Maybe it could be done better than this ?
Thanks, Riccardo
Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch).
Really many thanks for the pointer!, didn't know this one!
The process was automated with a python script and that needed a little
bit
of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better.
I don't understand what you mean. I just patched the c file and rebuild squeak (interp.c doesn't get generated at every build.)
Are you sure that the performance gain came from these changes? If so, then the only reason for the speedup i can think of is that the most common bytecodes' code are scattered across memory pages and there are more page faults with the switch-case implementation. I wonder if you could check how many page faults does the two implementations have.
Is there some automated test that I can run to verify just the vm troughput ? A cachegrind on that test would give more interesting result.
Thanks, Riccardo
2008/4/19 Riccardo Lucchese riccardo.lucchese@gmail.com:
Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch).
Really many thanks for the pointer!, didn't know this one!
The process was automated with a python script and that needed a little
bit
of code cleaning; after some testing I couldn't trigger any sort of bug in the new code.
AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better.
I don't understand what you mean. I just patched the c file and rebuild squeak (interp.c doesn't get generated at every build.)
Are you sure that the performance gain came from these changes? If so, then the only reason for the speedup i can think of is that the most common bytecodes' code are scattered across memory pages and there are more page faults with the switch-case implementation. I wonder if you could check how many page faults does the two implementations have.
Is there some automated test that I can run to verify just the vm troughput ? A cachegrind on that test would give more interesting result.
I suggest you, before proceeding further you should read here http://wiki.squeak.org/squeak/2105 and here: http://squeakvm.org
Thanks, Riccardo
I suggest you, before proceeding further you should read here http://wiki.squeak.org/squeak/2105 and here: http://squeakvm.org
Thanks for the links. I'm just pulling the source being used on the olpc platform.
There is not such a test-tool to measure troughput?
Thanks, Riccardo
2008/4/19 Riccardo Lucchese riccardo.lucchese@gmail.com:
I suggest you, before proceeding further you should read here http://wiki.squeak.org/squeak/2105 and here: http://squeakvm.org
Thanks for the links. I'm just pulling the source being used on the olpc platform.
There is not such a test-tool to measure troughput?
Most smalltalkers measure throughput with following st code:
0 tinyBenchmarks
:) There also some of the macro benchmarks, if you search a mail there was a discussion concerning them lately.
Thanks, Riccardo
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
I run this before any mac carbon VM is shipped and compare the results to previous runs to ensure the compiler hasn't broken the expected assembler optimization.
On Apr 19, 2008, at 9:22 AM, Riccardo Lucchese wrote:
Most smalltalkers measure throughput with following st code:
0 tinyBenchmarks
mm.. how do I run it ? :)
... couldn't found any pointer in 40mins of googling. I never worked with st and that single line is driving me crazy :)
-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================
Hi!
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
What I'm doing is: - run startsqueak - dragging a trasncript window from the tools bar to the 'desktop' - write the line you suggested me * wait for the magic, but nothing happens :)
Shame on me... mm how do kids actually sort it out ? :)
Thanks you for your help. I'd just like to see some numbers and than shutup if I'm all wrong with my initial assumptions.
Riccardo
Hi!
Try pressing alt + d (if you have an alt key) or open the menu and select "do it"
Levente
Quoting Riccardo Lucchese riccardo.lucchese@gmail.com:
Hi!
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
What I'm doing is:
- run startsqueak
- dragging a trasncript window from the tools bar to the 'desktop'
- write the line you suggested me
- wait for the magic, but nothing happens :)
Shame on me... mm how do kids actually sort it out ? :)
Thanks you for your help. I'd just like to see some numbers and than shutup if I'm all wrong with my initial assumptions.
Riccardo
"Riccardo" == Riccardo Lucchese riccardo.lucchese@gmail.com writes:
Riccardo> Hi!
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
Riccardo> What I'm doing is: Riccardo> - run startsqueak Riccardo> - dragging a trasncript window from the tools bar to the 'desktop' Riccardo> - write the line you suggested me Riccardo> * wait for the magic, but nothing happens :)
Riccardo> Shame on me... mm how do kids actually sort it out ? :)
Transcript is not Workspace.
Transcript is where you see things.
Workspace is where you type things, then use do-it or print-it or debug-it.
2008/4/20 Randal L. Schwartz merlyn@stonehenge.com:
"Riccardo" == Riccardo Lucchese riccardo.lucchese@gmail.com writes:
Riccardo> Hi!
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
Riccardo> What I'm doing is: Riccardo> - run startsqueak Riccardo> - dragging a trasncript window from the tools bar to the 'desktop' Riccardo> - write the line you suggested me Riccardo> * wait for the magic, but nothing happens :)
Riccardo> Shame on me... mm how do kids actually sort it out ? :)
Transcript is not Workspace.
Transcript is where you see things.
Workspace is where you type things, then use do-it or print-it or debug-it.
Actually there are little difference between transcript and workspace or any other place where you can enter text and press 'doit' :) Inspector and debugger related to context, so you can send messages to inst vars or use method's temps or arguments in expression. But it makes little difference for expressions which is context free , like '0 tinyBenchmarks'
-- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 merlyn@stonehenge.com URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
On 20.04.2008, at 01:09, Riccardo Lucchese wrote:
Hi!
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
What I'm doing is:
- run startsqueak
- dragging a trasncript window from the tools bar to the 'desktop'
- write the line you suggested me
- wait for the magic, but nothing happens :)
Shame on me... mm how do kids actually sort it out ? :)
Thanks you for your help. I'd just like to see some numbers and than shutup if I'm all wrong with my initial assumptions.
Riccardo,
you posted to squeak-dev so we assumed you were familiar with Squeak. For beginner's questions we have a very friendly mailing list:
http://lists.squeakfoundation.org/mailman/listinfo/beginners
You should get somewhat accustomed to how to program in Squeak/ Smalltalk before digging into the virtual machine. If you had to improve the performance of a bash script, you wouldn't start with hacking the Linux kernel either, would you? ;-)
Anyway, we greatly appreciate help, in particular on the OLPC platform. Great to have a summer-of-code student on board! I am one of the core developers for the OLPC version of Etoys. We have an OLPC- specific mailing list at
http://lists.laptop.org/listinfo/etoys
This list here is for general Squeak topics, there's nothing too hard to ask, the fine folks around actually like answering tough questions (just not the kind of very basic "what do I have to do to execute code" type questions). The virtual machine hackers have their own list, as most folks here are quite happy not having to deal with C anymore:
http://lists.squeakfoundation.org/mailman/listinfo/vm-dev
For general improvements I tried last week to compile a Geode- optimized virtual machine using Rob Savoye's tool chain. It seemed to have worked but I saw no improvement at all. Maybe I did something wrong in installing his tools, though.
Most optimization possibilities I see are on the Smalltalk-level anyways. And it's possible to get quite some speed improvements there - like what Ties Stuij of OLE Nepal did, who is new to Squeak, too, and already improved project loading time quite a bit (granted, he was "initiated" by Luke Gorrie who had hands-on training be me - your best bet to get up-to-speed would be to find a Squeak hacker near you. Where are you located?)
On the VM-level I see some improvement possibilities in particular in drawing speed. We use a generic BitBlt implementation for everything. Rotated bitmap display ("WarpBlt") is taking quite some time on the XO due to its incredibly high-resolution display. 1200x900 pixels make a lot of bits to push around for the little Geode CPU. Recoding the inner WarpBlt loop in MMX for example could gain quite a lot, also the color-format conversion (Squeak is 5-5-5, the X server uses 5-6-5).
- Bert -
Dear Bert,
Anyway, we greatly appreciate help, in particular on the OLPC platform. Great to have a summer-of-code student on board! I am one of the core developers for the OLPC version of Etoys. We have an OLPC-specific mailing list at
subscribed :)
This list here is for general Squeak topics, there's nothing too hard to ask, the fine folks around actually like answering tough questions (just not the kind of very basic "what do I have to do to execute code" type questions). The virtual machine hackers have their own list, as most folks here are quite happy not having to deal with C anymore:
It's ok, they were very nice to me :). I just stopped posting as I was trying to understand more of squeak/etoys before giving any feedback.
Most optimization possibilities I see are on the Smalltalk-level anyways. And it's possible to get quite some speed improvements there - like what Ties Stuij of OLE Nepal did, who is new to Squeak, too, and already improved project loading time quite a bit (granted, he was "initiated" by Luke Gorrie who had hands-on training be me - your best bet to get up-to-speed would be to find a Squeak hacker near you. Where are you located?)
I live near Venice in Italy.
On the VM-level I see some improvement possibilities in particular in drawing speed. We use a generic BitBlt implementation for everything. Rotated bitmap display ("WarpBlt") is taking quite some time on the XO due to its incredibly high-resolution display. 1200x900 pixels make a lot of bits to push around for the little Geode CPU. Recoding the inner WarpBlt loop in MMX for example could gain quite a lot, also the color-format conversion (Squeak is 5-5-5, the X server uses 5-6-5).
My main goal is profiling Sugar and I don't think that I'll be able to work on the st part of squeak; will see :)
I'll mail you directly (hope it's ok) and stop the noise here :)
Riccardo
Hi, Riccardo,
For OLPC Nepal, the backward compatibility from Squeak3.2 wouldn't needed. So, In ImageSegment>>comeFullyUpOnReload:, the part below can be removed.
--------------------------------------- arrayOfRoots do: [:importedObject | ((importedObject isMemberOf: WideString) or: [importedObject isMemberOf: WideSymbol]) ifTrue: [ importedObject mutateJISX0208StringToUnicode. importedObject class = WideSymbol ifTrue: [ "self halt." Symbol hasInterned: importedObject asString ifTrue: [:multiSymbol | multiSymbol == importedObject ifFalse: [ importedObject becomeForward: multiSymbol. ]. ]. ]. ]. (importedObject isMemberOf: TTCFontSet) ifTrue: [ existing _ TTCFontSet familyName: importedObject familyName pointSize: importedObject pointSize. "supplies default" existing == importedObject ifFalse: [importedObject becomeForward: existing]. ]. ]. ---------------------------------------
The real gain would come if we cut out inefficiency in SmartRefStream around the class shape change and the nested structure from DataStream. Is there something we can primitivized around it?
A simple entry point for profiling project loading would be something like:
MessageTally spyOn: [ProjectLoading loadFromDir: '/usr/share/etoys/ExampleEtoys' projectName: 'DemonCastle1'].
"Near Venice"... That almost made me want to fly there^^;
-- Yoshiki
At Mon, 21 Apr 2008 12:39:28 +0200, Riccardo Lucchese wrote:
Dear Bert,
Anyway, we greatly appreciate help, in particular on the OLPC platform. Great to have a summer-of-code student on board! I am one of the core developers for the OLPC version of Etoys. We have an OLPC-specific mailing list at
subscribed :)
This list here is for general Squeak topics, there's nothing too hard to ask, the fine folks around actually like answering tough questions (just not the kind of very basic "what do I have to do to execute code" type questions). The virtual machine hackers have their own list, as most folks here are quite happy not having to deal with C anymore:
It's ok, they were very nice to me :). I just stopped posting as I was trying to understand more of squeak/etoys before giving any feedback.
Most optimization possibilities I see are on the Smalltalk-level anyways. And it's possible to get quite some speed improvements there - like what Ties Stuij of OLE Nepal did, who is new to Squeak, too, and already improved project loading time quite a bit (granted, he was "initiated" by Luke Gorrie who had hands-on training be me - your best bet to get up-to-speed would be to find a Squeak hacker near you. Where are you located?)
I live near Venice in Italy.
On the VM-level I see some improvement possibilities in particular in drawing speed. We use a generic BitBlt implementation for everything. Rotated bitmap display ("WarpBlt") is taking quite some time on the XO due to its incredibly high-resolution display. 1200x900 pixels make a lot of bits to push around for the little Geode CPU. Recoding the inner WarpBlt loop in MMX for example could gain quite a lot, also the color-format conversion (Squeak is 5-5-5, the X server uses 5-6-5).
My main goal is profiling Sugar and I don't think that I'll be able to work on the st part of squeak; will see :)
I'll mail you directly (hope it's ok) and stop the noise here :)
Riccardo
On Apr 19, 2008, at 8:36 AM, Igor Stasenko wrote:
There also some of the macro benchmarks, if you search a mail there was a discussion concerning them lately.
Smalltalk macroBenchmarks
A quick check showed the squeak image at http://ftp.squeak.org/3.4/
contained the code, and it was deleted in later versions, not sure when, but someone could look.
I note standardTime: has to be changed
the tieDown _ ByteArray new: spaceLeft - 1e7. "Leave exactly 10MB free"
should be removed, it's intent was to pre-allocate memory but this might depending on the VM allocate 512MB or 1024MB of virtual memory because of how the VMs work now, and not have the desired effect it had when VM's had fixed memory sizes eight or so years back.
-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================
On 19-Apr-08, at 7:22 AM, Riccardo Lucchese wrote:
AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better.
I don't understand what you mean. I just patched the c file and rebuild squeak (interp.c doesn't get generated at every build.)
The VM code exists in two parts a) the handwritten C (or C++ in some places) that implements the platform specific bits of the system such as the interface to the OS filing system, the IO etc. There are also some VM plugins with handwritten parts. b) the Slang code implemented in Smalltalk that get translated into C by the VMMaker tool.
VMMaker is discussed at http://wiki.squeak.org/squeak/VMMaker and related links. It translates the relevant classes, writes the translated code to files, copies some of the handwritten files to suit the platform make setup and creates some config files. Each platform then has its own idiosyncratic (not to mention frequently idiotic) process to follow to compile and link that bunch into a working VM and (possibly, depending on details not relevant here) plugins.
The central VM file is interp.c and you really shouldn't waste your time playing with it. It is barely readable, poorly structured and should be considered a temporary intermediate file; would you bother trying to read the intermediate language file(s) some compilers produce? The exception to this is when you have some really simple problem in the compiler - like complaints about a type mismatch for example - that you can very quickly edit in interp.c to see if it can be made to compile, after which you have to make the proper fix in the Slang code. Again, interp.c is a temporary file that should never ever be saved in any source control system. And yes, I know that it frequently is, and that is just plain bugfuck stupid.
With regard to your original message about the bytecode dispatch 'switch' statement - well I don't think I've seen a compiler do something that dumb in a long time. A jump or branch table is the normal case and as has been mentioned the 'gnuify' step in the makefiles of most platforms even improves upon that when possible. One could consider building a table of function pointers and indexing into it with the bytecode - but that would put the table outside the run of the code and would require bytecode routines to be functions etc etc. In the current system the bytecode 'functions' are textually inlined into the main interpret() function as you would see by examing the VMMaker related code. It is distressingly easy to come up with 'optimisations' that completely ruin performance. Further, the bytecode dispatch time is unlikely to figure anywhere in typical performance profiles. Would anyone really characterise the performance of a system on the basis of how fast it can do trivial loops? Or is it more sensible to have decent benchmarks that reflect real usage?
It is important to measure the right things the right way under the right circumstances if you want to uncover meaningful facts. In the early days of RISC development the designers would explain about how they ran benchmarks to find out what a cpu really spent its time doing and then they designed the architecture to optimise for those activities - no Test Left, Shift Mask and Dim the Lights instruction for us! - so that all would be wonderful. It was rather slyly pointed out that this could only result in a cpu that could run diagnostics and Eratosthenes Sieve *really* fast and left everything else to exception handling
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Futuristic: It will only run on a next generation supercomputer.
Levente Uzonyi wrote:
It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution.
Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch).
In theory. In practice, the effect of branch-mispredictions can be quite noticable:
http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-July/061996.html
Cheers, - Andreas
On Apr 19, 2008, at 1:28 AM, Riccardo Lucchese wrote:
Hi,
I'm working on profiling squeak/etoys for the olpc project.
Well I'll note since you are working with a specific hardware setup it's worth considering looking very carefully at the compiler optimization for the CPU in question. I'd guess the compiler setting used by the build process is fairly generic and does not take advantage of specific platform optimizations, thus producing safe and lousy code to make a 80386 happy...
oh say see http://lists.laptop.org/pipermail/devel/2007-November/007430.html http://wiki.laptop.org/go/Geode_optimization_effort http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00452.html
Also setting -DUSE_INLINE_MEMORY_ACCESSORS or NOT setting in the compiler options to change the code usage for accessors in interp.c might also have a different result. since that changes now memory accessing is done.
The other area is variable accessing, you'll note the foo structure is built and the variables are sorted by usage, I did this to guess better at cache line optimization versus a random distribution, or sorted alphabetically. However I don't think anyone has done a careful study to see how best the variables should be arranged. Mind when I did this on the powerpc I was *unable* to see any difference.
Lastly ArraysToGlobalStruct-JMM.1.cs in the squeak mac source tree is used to modify VMMaker to put or not put arrays in the global structure foo. In the past depending on the compiler would produce good or horrible code for accessing an array that was part of a structure.
You would need to look at this change set, the current VMMaker code, then see if using it makes a difference in the assembler code generated.
Oh and -fgcse
Fiddling with that might produce different things depending on the compiler....
-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================
squeak-dev@lists.squeakfoundation.org