Dear Smalltalk community:
We are happy to announce, now officially, RoarVM: the first single-image manycore virtual machine for Smalltalk.
The RoarVM supports the parallel execution of Smalltalk programs on x86 compatible multicore systems and Tilera TILE64-based manycore systems. It is tested with standard Squeak 4.1 closure-enabled images, and with a stripped down version of a MVC-based Squeak 3.7 image. Support for Pharo 1.2 is currently limited to 1 core, but we are working on it.
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research. The port to x86 multicore systems was done by me. They open-sourced their VM, formerly know as Renaissance VM (RVM), under the Eclipse Public License [1]. Official announcement of the IBM source code release: http://soft.vub.ac.be/~smarr/rvm-open-source-release/
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
The open source project, and downloads can also be found on GitHub: http://github.com/smarr/RoarVM http://github.com/smarr/RoarVM/downloads
For more detailed information, please refer to the README file[3]. Instructions to compile the RoarVM on Linux and OS X can be found at [4]. Windows is currently not supported, however, there are good chances that it will work with cygwin or pthreads for win32, but that has not be verified in anyway. If you feel brave, please give it a shot and report back.
If the community does not object, we would like to occupy the vm-dev@lists.squeakfoundation.org mailinglist for related discussions. So, if you run into any trouble while experimenting with the RoarVM, do not hesitate to report any problems and ask any questions.
You can also follow us on Twitter @roarvm [5].
Best regards Stefan Marr
[1] http://www.eclipse.org/legal/epl-v10.html [2] http://soft.vub.ac.be/~smarr/rvm-open-source-release/ [3] http://github.com/smarr/RoarVM/blob/master/README.rst [4] http://github.com/smarr/RoarVM/blob/master/INSTALL.rst [5] http://twitter.com/roarvm
On 3 November 2010 15:13, Stefan Marr squeak@stefan-marr.de wrote:
[snip]
If the community does not object, we would like to occupy the vm-dev@lists.squeakfoundation.org mailinglist for related discussions. So, if you run into any trouble while experimenting with the RoarVM, do not hesitate to report any problems and ask any questions.
No objections from me! You are welcome there!
On 03/11/10 9:13 AM, Stefan Marr wrote:
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
I'm trying to understand what is meant by this benchmark. How does tinyBenchmarks get run on 8 cores. How does the work get distributed among the cores. But I cannot find the answer quickly - see below.
For more detailed information, please refer to the README file[3].
The links take you to the ACM portal. One promising link on the portal: DLS '09 Proceedings of the 5th symposium on Dynamic languages is a dead link to HPI (www.uni-potsdam.de). I doubt I want to subscribe to ACM portal at this point, just to find out more about RoarVM. And, given that ACM portal is the only place to find out more, I doubt I'm going to look at RoarVM for any more than the brief glance through the github source that I've already done.
Hi Yanni:
On 03 Nov 2010, at 17:45, Yanni Chiu wrote:
On 03/11/10 9:13 AM, Stefan Marr wrote:
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
I'm trying to understand what is meant by this benchmark. How does tinyBenchmarks get run on 8 cores. How does the work get distributed among the cores. But I cannot find the answer quickly - see below.
It is an adapted version of the tinyBenchmarks. I will make the code available soonish...
Anyway, the idea is that the benchmark is started n-times in n-Smalltalk processes. Then the overall time for execution is measured. If actually more cores are scheduling the started processes for execution, you will see the increase above. Otherwise, you just see your normal sequential performance.
For more detailed information, please refer to the README file[3].
The links take you to the ACM portal. One promising link on the portal: DLS '09 Proceedings of the 5th symposium on Dynamic languages is a dead link to HPI (www.uni-potsdam.de). I doubt I want to subscribe to ACM portal at this point, just to find out more about RoarVM. And, given that ACM portal is the only place to find out more, I doubt I'm going to look at RoarVM for any more than the brief glance through the github source that I've already done.
I am sorry that there is a pay-wall, however, I can't really do anything about it... But, you can always ask the authors of a particular paper to give you a copy, I suppose.
So, the question is how the work gets distributed? Well, like in Multiprocessor Smalltalk:
You have one scheduler, like in a standard image, it is only slightly adapted to accommodate the fact that there can be more than one activeProcess (See http://github.com/smarr/RoarVM/raw/master/image.st/RVM-multicore-support.mvc...).
Once a core gets to a point where it needs to revaluate its scheduling decision, it goes to the central scheduler and picks a process to execute it. Nothing fancy here, no sophisticated work stealing or distribution process.
Does that answer your question?
Best regards Stefan
On 03/11/10 1:11 PM, Stefan Marr wrote:
So, the question is how the work gets distributed? Well, like in Multiprocessor Smalltalk:
You have one scheduler, like in a standard image, it is only slightly adapted to accommodate the fact that there can be more than one activeProcess (See http://github.com/smarr/RoarVM/raw/master/image.st/RVM-multicore-support.mvc...).
Once a core gets to a point where it needs to revaluate its scheduling decision, it goes to the central scheduler and picks a process to execute it. Nothing fancy here, no sophisticated work stealing or distribution process.
Does that answer your question?
Not on first reading, but combined with your other answers, I think I've got it.
On 11/3/2010 6:13 AM, Stefan Marr wrote:
We are happy to announce, now officially, RoarVM: the first single-image manycore virtual machine for Smalltalk.
Congrats! That's a great step forward.
RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research. The port to x86 multicore systems was done by me. They open-sourced their VM, formerly know as Renaissance VM (RVM), under the Eclipse Public License [1]. Official announcement of the IBM source code release: http://soft.vub.ac.be/~smarr/rvm-open-source-release/
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
If the community does not object, we would like to occupy the vm-dev@lists.squeakfoundation.org mailinglist for related discussions. So, if you run into any trouble while experimenting with the RoarVM, do not hesitate to report any problems and ask any questions.
Sounds great. Thanks for sharing.
Cheers, - Andreas
Hi Andreas:
On 03 Nov 2010, at 17:57, Andreas Raab wrote:
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start. Could you be a bit more specific with your question?
A probably a bit too condensed overview is given in the technical section of the README. http://github.com/smarr/RoarVM/blob/master/README.rst
Best regards Stefan
On 11/3/2010 10:27 AM, Stefan Marr wrote:
On 03 Nov 2010, at 17:57, Andreas Raab wrote:
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start. Could you be a bit more specific with your question?
Well, I was hoping you could tell us where to start. After all you know more about this stuff than we do :-)
A probably a bit too condensed overview is given in the technical section of the README. http://github.com/smarr/RoarVM/blob/master/README.rst
Any chance you could make versions of the referenced papers available? This would surely help.
cheers, - Andreas
http://www.duke.edu/~jd135/papers/ has copies of the two papers referenced in the readme.
Cheers, -- John
On Wed, Nov 3, 2010 at 10:39, Andreas Raab andreas.raab@gmx.de wrote:
On 11/3/2010 10:27 AM, Stefan Marr wrote:
On 03 Nov 2010, at 17:57, Andreas Raab wrote:
The source code of the RoarVM has been released as open source to enable
the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start. Could you be a bit more specific with your question?
Well, I was hoping you could tell us where to start. After all you know more about this stuff than we do :-)
A probably a bit too condensed overview is given in the technical section
of the README. http://github.com/smarr/RoarVM/blob/master/README.rst
Any chance you could make versions of the referenced papers available? This would surely help.
cheers,
- Andreas
Hi Andreas:
On 03 Nov 2010, at 18:39, Andreas Raab wrote:
On 11/3/2010 10:27 AM, Stefan Marr wrote:
On 03 Nov 2010, at 17:57, Andreas Raab wrote:
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start. Could you be a bit more specific with your question?
Well, I was hoping you could tell us where to start. After all you know more about this stuff than we do :-)
Partially, the question is also what degree of detail do you want?
The VM runs one interpreter per core. At the moment, you decide on the number of cores you want to use at startup.
Once the interpreters are initialized they try to grab some work. For this, there is the standard scheduler, just protected by a mutex. As soon as an interpreter gets hold of a process for execution it starts to execute it. And then there are the normal points when it could reevaluate the scheduling decision.
Other than that, there are some important simplifications in the VM. For instance it uses an object table to be able to move objects around easily. The GC uses a simple compacting mark/sweep stop-the-world algorithm. Many of the design decisions are made with respect to the manycore architecture we are running the VM on. So, they are not necessarily optimal for today's x86 multicore systems. For instance, the interpreters pass messages instead of using shared memory for certain information that needs to be kept consistent.
So, well, I hope that gives you some points you could ask more detailed questions about :)
A probably a bit too condensed overview is given in the technical section of the README. http://github.com/smarr/RoarVM/blob/master/README.rst
Any chance you could make versions of the referenced papers available? This would surely help.
I asked David.
Best regards Stefan
The "trick" seems to me to have been lots of blood sweat and tears. Debugging this thing has been some of the toughest work I've tackled at times, and I bet Stefan would agree. A separate interpreter per core, a common address space for the objects, separate memory areas for each core, each object freely references any other object, an object table to make it each to move objects from core to core, straightforward extension of ProcessorScheduler, and a too-simple-at-the-moment garbage collector.
Lots of interesting bits in the safepoint and messaging systems.
- David
On Nov 3, 2010, at 9:57 AM, Andreas Raab wrote:
On 11/3/2010 6:13 AM, Stefan Marr wrote:
We are happy to announce, now officially, RoarVM: the first single-image manycore virtual machine for Smalltalk.
Congrats! That's a great step forward.
RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research. The port to x86 multicore systems was done by me. They open-sourced their VM, formerly know as Renaissance VM (RVM), under the Eclipse Public License [1]. Official announcement of the IBM source code release: http://soft.vub.ac.be/~smarr/rvm-open-source-release/
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
If the community does not object, we would like to occupy the vm-dev@lists.squeakfoundation.org mailinglist for related discussions. So, if you run into any trouble while experimenting with the RoarVM, do not hesitate to report any problems and ask any questions.
Sounds great. Thanks for sharing.
Cheers,
- Andreas
On 03.11.2010, at 14:13, Stefan Marr wrote:
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
I tried your precompiled OS X VM and the Sly3 image.
1 core: 93,910,491 bytecodes/sec; 4,056,440 sends/sec 2 cores: 91,559,370 bytecodes/sec; 4,007,927 sends/sec 3 cores: can't start 4 cores: 90,844,570 bytecodes/sec; 3,935,516 sends/sec 5 cores: can't start 6 cores: can't start 7 cores: can't start 8 cores: 89,698,668 bytecodes/sec; 3,910,787 sends/sec
So it looks like you have to use a power-of-two cores?
And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
I tried something myself:
n := 16. q := SharedQueue new. time := Time millisecondsToRun: [n timesRepeat: [[q nextPut: [30 benchFib] timeToRun] fork]. n timesRepeat: [Transcript space; show: q next]]. Transcript space; show: time; cr
1 core: 664 664 665 666 667 662 664 664 668 665 667 665 666 669 666 10700 2 cores: 675 674 672 669 677 669 669 672 678 670 668 669 674 668 668 5425 4 cores: 721 726 729 740 713 728 740 734 731 737 721 737 734 756 788 749 3030 8 cores: 786 807 837 847 865 872 916 840 800 873 792 880 846 865 829 1820
Now that scales pretty nicely :) The overhead is about 25% at 8 cores, 12% for 4 cores.
For our regular interpreter (*) I get: 1 core: 162 159 157 158 158 160 159 159 159 159 159 158 160 158 159 2585
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Btw, user interrupt didn't work on the Mac.
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
- Bert -
(*) For comparison, a regular interpreter (not Cog) on this machine gets 789,514,263 bytecodes/sec; 17,199,374 sends/sec and Cog does 880,481,513 bytecodes/sec; 70,113,306 sends/sec
Hi Bert:
On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:
So it looks like you have to use a power-of-two cores?
Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
The code I used to generate the numbers isn't actually in any image yet. I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Best regards Stefan
My tiny Benchmarks:
!Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:30'! tinyBenchmarksParallel16Processes "Report the results of running the two tiny Squeak benchmarks. ar 9/10/1999: Adjusted to run at least 1 sec to get more stable results" "0 tinyBenchmarks" "On a 292 MHz G3 Mac: 22727272 bytecodes/sec; 984169 sends/sec" "On a 400 MHz PII/Win98: 18028169 bytecodes/sec; 1081272 sends/sec" | t1 t2 r n1 n2 | n1 := 1. [t1 := Time millisecondsToRun: [n1 benchmark]. t1 < 1000] whileTrue:[n1 := n1 * 2]. "Note: #benchmark's runtime is about O(n)"
"now n1 is the value for which we do the measurement" t1 := Time millisecondsToRun: [self run: #benchmark on: n1 times: 16].
n2 := 28. [t2 := Time millisecondsToRun: [r := n2 benchFib]. t2 < 1000] whileTrue:[n2 := n2 + 1]. "Note: #benchFib's runtime is about O(k^n), where k is the golden number = (1 + 5 sqrt) / 2 = 1.618...." "now we have our target n2 and r value. lets take the time for it" t2 := Time millisecondsToRun: [self run: #benchFib on: n2 times: 16].
^ { ((n1 * 16 * 500000 * 1000) // t1). " printString, ' bytecodes/sec; '," ((r * 16 * 1000) // t2) " printString, ' sends/sec'" }! ! !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:29'! run: aSymbol on: aReceiver times: nTimes | mtx sig n |
mtx := Semaphore forMutualExclusion. sig := Semaphore new. n := nTimes.
nTimes timesRepeat: [ [ aReceiver perform: aSymbol. mtx critical: [ n := n - 1. (n == 0) ifTrue: [sig signal]] ] fork ]. sig wait.! !
On 04.11.2010, at 19:36, Stefan Marr wrote:
On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:
So it looks like you have to use a power-of-two cores?
Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
The code I used to generate the numbers isn't actually in any image yet. I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
I was doing the equivalent of
SharedQueue new next
and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
- Bert -
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
One thing I forgot to mentioned in this context, it the object table we use. That is also something which is not exactly making the VM faster.
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
I was doing the equivalent of
SharedQueue new next
and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
Closing the window, how does that relate to processes? You mean a window inside the image? Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.
Ctrl-C doesn't work, that's true. However, closing the X11 window usually does the trick ;)
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
Ok, several seconds, hm. That does not really sound like the GC pauses I see. But I haven't used the Squeak image a lot myself on the RoarVM. I was more thinking in the direction of what kind of tricks are currently pulled to make things fast. Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.
Best regards Stefan
- Bert -
On 04.11.2010, at 21:18, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
Wait. What do you mean by "current version" vs. "our version"?
The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
I don't really remember but it was well below 50%, more like 10%-20% I think.
One thing I forgot to mentioned in this context, it the object table we use. That is also something which is not exactly making the VM faster.
Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
I was doing the equivalent of
SharedQueue new next
and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
Closing the window, how does that relate to processes? You mean a window inside the image? Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.
The X11 window.
Ctrl-C doesn't work, that's true. However, closing the X11 window usually does the trick ;)
Well, it didn't. At least not immediately. Took several seconds ad ctrl-c's untill I got my prompt back.
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
Ok, several seconds, hm. That does not really sound like the GC pauses I see. But I haven't used the Squeak image a lot myself on the RoarVM. I was more thinking in the direction of what kind of tricks are currently pulled to make things fast. Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.
Best regards Stefan
I guess some profiling is in order ...
- Bert -
On performance: we have done very little to tune it. Please feel free to pitch in! I suspect you might find some tasty, low-hanging fruit.
- David
On Nov 4, 2010, at 1:49 PM, Bert Freudenberg wrote:
On 04.11.2010, at 21:18, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
Wait. What do you mean by "current version" vs. "our version"?
The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
I don't really remember but it was well below 50%, more like 10%-20% I think.
One thing I forgot to mentioned in this context, it the object table we use. That is also something which is not exactly making the VM faster.
Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
I was doing the equivalent of
SharedQueue new next
and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
Closing the window, how does that relate to processes? You mean a window inside the image? Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.
The X11 window.
Ctrl-C doesn't work, that's true. However, closing the X11 window usually does the trick ;)
Well, it didn't. At least not immediately. Took several seconds ad ctrl-c's untill I got my prompt back.
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
Ok, several seconds, hm. That does not really sound like the GC pauses I see. But I haven't used the Squeak image a lot myself on the RoarVM. I was more thinking in the direction of what kind of tricks are currently pulled to make things fast. Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.
Best regards Stefan
I guess some profiling is in order ...
- Bert -
Hi Bert:
On 04 Nov 2010, at 21:49, Bert Freudenberg wrote:
Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
Wait. What do you mean by "current version" vs. "our version"?
Just imprecise wording, mixed with my wish to actually experiment with that issue. On the Tilera we are using the process model with Tilera specific libraries. On x86 systems, we are using threads. However, that makes things quite a bit slower, and when I find the time, I will implement the necessary cross-process shared-memory libraries to get rid of the threads and use processes like on Tilera.
The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
I don't really remember but it was well below 50%, more like 10%-20% I think.
Hm, ok, that doesn't sound like something I want to spend time on then.
One thing I forgot to mentioned in this context, it the object table we use. That is also something which is not exactly making the VM faster.
Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?
Well, depends on what the compiler is doing with our code. It is supposed to inline, and there are no virtual calls...
Best regards Stefan
On Thu, 4 Nov 2010, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
If threaded means gnuified (jump table instead of the linear search), then it gives ~2x speedup for the standard SqueakVM.
Levente
snip
On 6 November 2010 17:26, Levente Uzonyi leves@elte.hu wrote:
On Thu, 4 Nov 2010, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
If threaded means gnuified (jump table instead of the linear search), then it gives ~2x speedup for the standard SqueakVM.
to my own experience it gives 30%
Levente
snip
On Sat, 6 Nov 2010, Igor Stasenko wrote:
On 6 November 2010 17:26, Levente Uzonyi leves@elte.hu wrote:
On Thu, 4 Nov 2010, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
If threaded means gnuified (jump table instead of the linear search), then it gives ~2x speedup for the standard SqueakVM.
to my own experience it gives 30%
Right, it depends on what we take into account. According to this mail: http://lists.squeakfoundation.org/pipermail/vm-dev/2010-January/003761.html tinyBenchmarks gives '248543689 bytecodes/sec; 8117987 sends/sec' without gnuification and '411244979 bytecodes/sec; 10560900 sends/sec' with gnuification.
These aren't fully optimized VMs, so the difference may be smaller or larger with better optimizations. Anyway in this case in terms of bytecodes the difference is 65%, for sends it's 30%. So the general speedup is not 2x, but it's not 30% either.
The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
Levente
Levente
snip
-- Best regards, Igor Stasenko AKA sig.
2010/11/6 Levente Uzonyi leves@elte.hu:
On Sat, 6 Nov 2010, Igor Stasenko wrote:
On 6 November 2010 17:26, Levente Uzonyi leves@elte.hu wrote:
On Thu, 4 Nov 2010, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
> So RoarVM is about 4 times slower in sends, even more so for bytecodes. > It needs 8 cores to be faster the regular interpreter on a single core. To > the good news is that it can beat the old interpreter :) But why is it so > much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM? How much do you typically gain by it?
If threaded means gnuified (jump table instead of the linear search), then it gives ~2x speedup for the standard SqueakVM.
to my own experience it gives 30%
Right, it depends on what we take into account. According to this mail: http://lists.squeakfoundation.org/pipermail/vm-dev/2010-January/003761.html tinyBenchmarks gives '248543689 bytecodes/sec; 8117987 sends/sec' without gnuification and '411244979 bytecodes/sec; 10560900 sends/sec' with gnuification.
These aren't fully optimized VMs, so the difference may be smaller or larger with better optimizations. Anyway in this case in terms of bytecodes the difference is 65%, for sends it's 30%. So the general speedup is not 2x, but it's not 30% either.
The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
Maybe because C sucks? :) But if seriously, if you look into numerous cases where switch statement used, only few of them would have an ordered set of cases, and from them, only few will have a power of two number of cases. So, its not worth the effort.
Another way would be to use function table, but then compiler should be able to inline all functions from that table, which is also non-trivial from compiler perspective, since its indirect.
Levente
Levente
snip
-- Best regards, Igor Stasenko AKA sig.
On Sat, 6 Nov 2010, Igor Stasenko wrote:
2010/11/6 Levente Uzonyi leves@elte.hu:
The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
Maybe because C sucks? :) But if seriously, if you look into numerous cases where switch statement used, only few of them would have an ordered set of cases, and from them, only few will have a power of two number of cases. So, its not worth the effort.
When I learned C, my teacher said that the switch statement was designed to be compiled to jump tables in most cases. I was a bit disappointed when I realized that most compilers don't even try to do it.
I guess you're wrong about the jump table implementation. Here's a simple example:
switch (i) { case 0: ...; case 1: ...; case 2: ...; default: ...; }
It doesn't have power of two size, but it still can be implemented as a jump table. You just need an "array", that has 3 slots with the pointers of the instructions of the cases. The mapping from keys to indices is simply identity. You just check the bounds and jump to the "default" address if the key is not in the range of the array (between 0 and 2 in this case). If there are "holes" in the table, say:
switch (i) { case 0: ...; case 1: ...; case 3: ...; default :...; }
then the slot at index 2 will contain the address of the "default" case.
If the keys have a different offset, or are multiplied by a number, then the key to index mapping is a bit more complicated:
switch (i) { case 28: ...; case 20: ...; case 16: ...; default }
The mapping is index >> 2 - 4. Also note that the cases are not ordered, but that doesn't matter. The array has the pointers to case 16, 20, 24 and 28.
I think these cases should be detected by the compiler. Also compilers could generate perfect hash tables for all other cases to achieve similar performance.
Levente
Another way would be to use function table, but then compiler should be able to inline all functions from that table, which is also non-trivial from compiler perspective, since its indirect.
Levente
Levente
snip
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
On 6 November 2010 20:30, Levente Uzonyi leves@elte.hu wrote:
On Sat, 6 Nov 2010, Igor Stasenko wrote:
2010/11/6 Levente Uzonyi leves@elte.hu:
The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
Maybe because C sucks? :) But if seriously, if you look into numerous cases where switch statement used, only few of them would have an ordered set of cases, and from them, only few will have a power of two number of cases. So, its not worth the effort.
When I learned C, my teacher said that the switch statement was designed to be compiled to jump tables in most cases. I was a bit disappointed when I realized that most compilers don't even try to do it.
I guess you're wrong about the jump table implementation. Here's a simple example:
switch (i) { case 0: ...; case 1: ...; case 2: ...; default: ...; }
It doesn't have power of two size, but it still can be implemented as a jump table. You just need an "array", that has 3 slots with the pointers of the instructions of the cases. The mapping from keys to indices is simply identity. You just check the bounds and jump to the "default" address if the key is not in the range of the array (between 0 and 2 in this case). If there are "holes" in the table, say:
switch (i) { case 0: ...; case 1: ...; case 3: ...; default :...; }
then the slot at index 2 will contain the address of the "default" case.
If the keys have a different offset, or are multiplied by a number, then the key to index mapping is a bit more complicated:
switch (i) { case 28: ...; case 20: ...; case 16: ...; default }
The mapping is index >> 2 - 4. Also note that the cases are not ordered, but that doesn't matter. The array has the pointers to case 16, 20, 24 and 28.
I think these cases should be detected by the compiler. Also compilers could generate perfect hash tables for all other cases to achieve similar performance.
I suppose its related to fact, that with small number of cases (like up to 16) its not worth using indirect jump, because with branch prediction you can achieve same or even better performance. That's why i think, modern compilers do not using jump tables. Having more than 16 number of cases, is an edge case. Not saying about 256.
Btw, I was also thinking that compilers generating jump tables for case statements.
Levente
Another way would be to use function table, but then compiler should be able to inline all functions from that table, which is also non-trivial from compiler perspective, since its indirect.
Levente
Levente
snip
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
On Sat, 6 Nov 2010, Igor Stasenko wrote:
I suppose its related to fact, that with small number of cases (like up to 16) its not worth using indirect jump, because with branch prediction you can achieve same or even better performance. That's why i think, modern compilers do not using jump tables. Having more than 16 number of cases, is an edge case. Not saying about 256.
I expect a compiler to generate the fastest possible code, if I ask for that. I think we agree, that gcc doesn't do this.
Levente
Btw, I was also thinking that compilers generating jump tables for case statements.
The other cheerful thing the switch statement does is put a range check in. This would cause a range check to be done for each bytecode looked up unless you did the gnuify. 15 years back we actually altered the 68K binary to no-op out the range check which improved performance a measurable amount on.
On 2010-11-06, at 1:42 PM, Levente Uzonyi wrote:
On Sat, 6 Nov 2010, Igor Stasenko wrote:
I suppose its related to fact, that with small number of cases (like up to 16) its not worth using indirect jump, because with branch prediction you can achieve same or even better performance. That's why i think, modern compilers do not using jump tables. Having more than 16 number of cases, is an edge case. Not saying about 256.
I expect a compiler to generate the fastest possible code, if I ask for that. I think we agree, that gcc doesn't do this.
Levente
Btw, I was also thinking that compilers generating jump tables for case statements.
-- =========================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ===========================================================================
Hello:
On 06 Nov 2010, at 21:42, Levente Uzonyi wrote:
On Sat, 6 Nov 2010, Igor Stasenko wrote:
I suppose its related to fact, that with small number of cases (like up to 16) its not worth using indirect jump, because with branch prediction you can achieve same or even better performance. That's why i think, modern compilers do not using jump tables. Having more than 16 number of cases, is an edge case. Not saying about 256.
I expect a compiler to generate the fastest possible code, if I ask for that. I think we agree, that gcc doesn't do this.
I don't really follow this discussion. Are you assuming that GCC is not using jumptables? Well, if I read the assembly code correctly, it does. See below.
However, Levente assumed that using jumptables is threaded code. But, no, threaded code is something different: http://en.wikipedia.org/wiki/Threaded_code Michael Haupt has a few nice slides visualizing what threaded code is about :) The idea is that you try to avoid the jump back to the switch/case and chain the bytecode implementations for a given method.
Best regards Stefan
void Squeak_Interpreter::dispatch(u_char currentByte) { if (Check_Prefetch) have_executed_currentBytecode = true; switch (currentByte) { case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9: case 10: case 11: case 12: case 13: case 14: case 15: pushReceiverVariableBytecode(); break; case 16: case 17: case 18: case 19: case 20: case 21: case 22: case 23: case 24: case 25: case 26: case 27: case 28: case 29: case 30: case 31: pushTemporaryVariableBytecode(); break;
GCC 4.2 -O3
0x00013b50 <+0000> push %ebp 0x00013b51 <+0001> mov %esp,%ebp 0x00013b53 <+0003> mov 0x8(%ebp),%edx 0x00013b56 <+0006> movb $0x1,0x4171(%edx) 0x00013b5d <+0013> movzbl 0xc(%ebp),%eax 0x00013b61 <+0017> jmp *0xc32ec(,%eax,4) 0x00013b68 <+0024> mov %edx,0x8(%ebp) 0x00013b6b <+0027> leave 0x00013b6c <+0028> jmp 0x42d50 <_ZN18Squeak_Interpreter15unknownBytecodeEv> 0x00013b71 <+0033> mov %edx,0x8(%ebp) 0x00013b74 <+0036> leave 0x00013b75 <+0037> jmp 0x46e10 <_ZN18Squeak_Interpreter20pushNewArrayBytecodeEv> 0x00013b7a <+0042> mov %edx,0x8(%ebp) 0x00013b7d <+0045> leave 0x00013b7e <+0046> jmp 0x43950 <_ZN18Squeak_Interpreter25pushActiveContextBytecodeEv> 0x00013b83 <+0051> mov %edx,0x8(%ebp) 0x00013b86 <+0054> leave 0x00013b87 <+0055> jmp 0x44450 <_ZN18Squeak_Interpreter20duplicateTopBytecodeEv> 0x00013b8c <+0060> mov %edx,0x8(%ebp) 0x00013b8f <+0063> leave 0x00013b90 <+0064> jmp 0x42ec0 <_ZN18Squeak_Interpreter16popStackBytecodeEv> 0x00013b95 <+0069> mov %edx,0x8(%ebp) 0x00013b98 <+0072> leave 0x00013b99 <+0073> jmp 0x45f50 <_ZN18Squeak_Interpreter26secondExtendedSendBytecodeEv> 0x00013b9e <+0078> mov %edx,0x8(%ebp) 0x00013ba1 <+0081> leave
Levente
Btw, I was also thinking that compilers generating jump tables for case statements.
On Nov 7, 2010, at 06:02 , Stefan Marr wrote:
Are you assuming that GCC is not using jumptables? Well, if I read the assembly code correctly, it does. See below.
GCC does produce jump tables for switch() whenever it makes sense to. The conditions & solution we need are more complex.
Given:
1. an integer driving a switch that is guaranteed to be in a known range (in this case 8-bit unsigned); 2. every case ends with a break to the end of the switch statement; 2. the switch is inside an infinite loop that generates the next integer for the switch,
then the compiler could decide to:
1. remove the range check; 2. turn off common subexpression elimination (a useless extra jump in many bytecodes); 3. combine the head of the loop (fetch) with the switch's indirect jump through the table; 4. distribute the combined fetch+jump over the entire switch as the final step in each branch (another useless jump in every bytecode).
That's an awful lot to ask of an optimiser for a general-purpose programming language.
On Nov 6, 2010, at 2:05 PM, Igor Stasenko siguctua@gmail.com wrote:
In an ideal world it would all be implemented in Smalltalk (or something similar) and dynamically translated directly into machine code.
Agreed. Critical optimisations should be part of the program specification, not the compiler implementation.
Michael Haupt has a few nice slides visualizing what threaded code is about :)
A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1...
Cheers, Ian
Hi,
Am 06.11.2010 um 22:45 schrieb Ian Piumarta piumarta@speakeasy.net:
Michael Haupt has a few nice slides visualizing what threaded code is about :)
A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1...
one of the sources extensively used in the slides. ;-)
Best,
Michael
Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter. Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84]) and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter when I could be adding a simple compiler instead. But that's just own experience.
- David
On Nov 6, 2010, at 3:05 PM, Michael Haupt wrote:
Hi,
Am 06.11.2010 um 22:45 schrieb Ian Piumarta piumarta@speakeasy.net:
Michael Haupt has a few nice slides visualizing what threaded code is about :)
A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1...
one of the sources extensively used in the slides. ;-)
Best,
Michael
Hi Dave,
Am 07.11.2010 um 02:47 schrieb ungar@mac.com:
Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter. Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84]) and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter when I could be adding a simple compiler instead. But that's just own experience.
sure! Those interpreter considerations are just one part of the slides ... JIT compilers are there as well.
One interesting observation that students made with their threading implementations is that it did not significantly improve performance: today's branch predictors seem to be too advanced to be impressed by threading. (They measured this using HPM.) Of course, the VM used in that course is really simple, and also has a very simple and small bytecode set.
Best,
Michael
Michael,
Yup. I was mainly trying to save folks who might invest a lot of work in threading, etc., from repeating my own mistakes.
Interesting about today's branch predictors!
Thank you,
- David
On Nov 7, 2010, at 12:06 AM, Michael Haupt wrote:
Hi Dave,
Am 07.11.2010 um 02:47 schrieb ungar@mac.com:
Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter. Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84]) and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter when I could be adding a simple compiler instead. But that's just own experience.
sure! Those interpreter considerations are just one part of the slides ... JIT compilers are there as well.
One interesting observation that students made with their threading implementations is that it did not significantly improve performance: today's branch predictors seem to be too advanced to be impressed by threading. (They measured this using HPM.) Of course, the VM used in that course is really simple, and also has a very simple and small bytecode set.
Best,
Michael
Hi:
On 07 Nov 2010, at 08:06, Michael Haupt wrote:
today's branch predictors seem to be too advanced to be impressed by threading.
Just as a side note: the share of machines with CPUs that have awesome branch-predictors is likely to decline. With all those low-powered systems and with manycore systems, the cost of using the die space/your power envelope for branch-prediction, out-of-order execution, and the like does not seem to be worth it when compared to just spending those transistors on additional cores. So, when you stay with an interpreter, these kind of optimizations will remain interesting for sequential performance.
Best regards Stefan
On 07.11.2010, at 10:20, Stefan Marr wrote:
Hi:
On 07 Nov 2010, at 08:06, Michael Haupt wrote:
today's branch predictors seem to be too advanced to be impressed by threading.
Just as a side note: the share of machines with CPUs that have awesome branch-predictors is likely to decline. With all those low-powered systems and with manycore systems, the cost of using the die space/your power envelope for branch-prediction, out-of-order execution, and the like does not seem to be worth it when compared to just spending those transistors on additional cores. So, when you stay with an interpreter, these kind of optimizations will remain interesting for sequential performance.
And given the current security paranoia which leads to certain platforms preventing jitting (memory pages are writable *or* executable, but bot both) then interpreters might have to stay around for quite some time ...
- Bert -
Hold on... don't we run on 56 cores on Tilera?
On Nov 4, 2010, at 11:36 AM, Stefan Marr wrote:
Hi Bert:
On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:
So it looks like you have to use a power-of-two cores?
Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
The code I used to generate the numbers isn't actually in any image yet. I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit. Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes. The GC is really not state of the art. And all that adds up rather quickly I suppose...
Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
Best regards Stefan
My tiny Benchmarks:
!Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:30'! tinyBenchmarksParallel16Processes "Report the results of running the two tiny Squeak benchmarks. ar 9/10/1999: Adjusted to run at least 1 sec to get more stable results" "0 tinyBenchmarks" "On a 292 MHz G3 Mac: 22727272 bytecodes/sec; 984169 sends/sec" "On a 400 MHz PII/Win98: 18028169 bytecodes/sec; 1081272 sends/sec" | t1 t2 r n1 n2 | n1 := 1. [t1 := Time millisecondsToRun: [n1 benchmark]. t1 < 1000] whileTrue:[n1 := n1 * 2]. "Note: #benchmark's runtime is about O(n)"
"now n1 is the value for which we do the measurement" t1 := Time millisecondsToRun: [self run: #benchmark on: n1 times: 16].
n2 := 28. [t2 := Time millisecondsToRun: [r := n2 benchFib]. t2 < 1000] whileTrue:[n2 := n2 + 1]. "Note: #benchFib's runtime is about O(k^n), where k is the golden number = (1 + 5 sqrt) / 2 = 1.618...." "now we have our target n2 and r value. lets take the time for it" t2 := Time millisecondsToRun: [self run: #benchFib on: n2 times: 16].
^ { ((n1 * 16 * 500000 * 1000) // t1). " printString, ' bytecodes/sec; '," ((r * 16 * 1000) // t2) " printString, ' sends/sec'" }! ! !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:29'! run: aSymbol on: aReceiver times: nTimes | mtx sig n |
mtx := Semaphore forMutualExclusion. sig := Semaphore new. n := nTimes.
nTimes timesRepeat: [ [ aReceiver perform: aSymbol. mtx critical: [ n := n - 1. (n == 0) ifTrue: [sig signal]] ] fork ]. sig wait.! !
-- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525
Hi David:
On 05 Nov 2010, at 05:07, ungar@mac.com wrote:
So it looks like you have to use a power-of-two cores?
Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
Hold on... don't we run on 56 cores on Tilera?
Yes, however, on OSX and standard Linux there are problems with certain configurations. If I remember correctly mmap or something fails with certain kinds of requests. Something related to the number of cores.
Have it on my todo list.
Best regards Stefan
On 4 November 2010 20:07, Bert Freudenberg bert@freudenbergs.de wrote:
On 03.11.2010, at 14:13, Stefan Marr wrote:
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
I tried your precompiled OS X VM and the Sly3 image.
1 core: 93,910,491 bytecodes/sec; 4,056,440 sends/sec 2 cores: 91,559,370 bytecodes/sec; 4,007,927 sends/sec 3 cores: can't start 4 cores: 90,844,570 bytecodes/sec; 3,935,516 sends/sec 5 cores: can't start 6 cores: can't start 7 cores: can't start 8 cores: 89,698,668 bytecodes/sec; 3,910,787 sends/sec
So it looks like you have to use a power-of-two cores?
And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
I tried something myself:
n := 16. q := SharedQueue new. time := Time millisecondsToRun: [n timesRepeat: [[q nextPut: [30 benchFib] timeToRun] fork]. n timesRepeat: [Transcript space; show: q next]]. Transcript space; show: time; cr
1 core: 664 664 665 666 667 662 664 664 668 665 667 665 666 669 666 10700 2 cores: 675 674 672 669 677 669 669 672 678 670 668 669 674 668 668 5425 4 cores: 721 726 729 740 713 728 740 734 731 737 721 737 734 756 788 749 3030 8 cores: 786 807 837 847 865 872 916 840 800 873 792 880 846 865 829 1820
Now that scales pretty nicely :) The overhead is about 25% at 8 cores, 12% for 4 cores.
i don't like this tendency. for 16 cores it will be 50%, and for 32 - 100% :) Doesn't sounds like 'designed for manycore systems'. But i suspect that it's because code you running don't takes new VM capabilities into account.
For our regular interpreter (*) I get: 1 core: 162 159 157 158 158 160 159 159 159 159 159 158 160 158 159 2585
So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
I would not care much about single core performance for now. Since once you got the potential of hundred of cores at your disposal, you can even run things at a lower clock rate, because it not really matters anymore.
Btw, user interrupt didn't work on the Mac.
And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
not a surprise. Image and code, which not aware of new VM capabilities, usually wins nothing, and even losing comparing to 'standard' VM.
Hydra VM were able to run multiple interpreters in single process space, and overhead of this are 5-10% performance degradation. But given that you can run N interpreters in parallel, such slowdown can be neglected.
- Bert -
(*) For comparison, a regular interpreter (not Cog) on this machine gets 789,514,263 bytecodes/sec; 17,199,374 sends/sec and Cog does 880,481,513 bytecodes/sec; 70,113,306 sends/sec
On Wednesday, November 3, 2010, Stefan Marr squeak@stefan-marr.de wrote:
Dear Smalltalk community:
We are happy to announce, now officially, RoarVM: the first single-image manycore virtual machine for Smalltalk.
The RoarVM supports the parallel execution of Smalltalk programs on x86 compatible multicore systems and Tilera TILE64-based manycore systems. It is tested with standard Squeak 4.1 closure-enabled images, and with a stripped down version of a MVC-based Squeak 3.7 image. Support for Pharo 1.2 is currently limited to 1 core, but we are working on it.
What great news!
The main question that comes to mind for me is concurrency. Does this VM do anything special to preserve the concurrency semantics of smalltalk processes scheduled on a single core? As I'm sure most people are aware, the existing squeak library isn't written with thread safety in mind...and as such, even on a single core a naive implementation can have issues when executing with concurrent processes. The solution is generally to try and isolate the objects running in different, concurrent processes and use message passing and replication as needed. If this VM doesn't do anything in this regard, I would expect that it would present an even greater risk of issues stemming from concurrency and it would make it all the more important to keep objects accessible in different processes cleanly separated.
Another question that comes to mind is how much of a performance hit you might see from the architecture trying to maintain cache consistency in the face of multiple processes simultaneously updating shared memory. Is that something you've found to be an issue? Is this something you would have to be careful about when crafting code? And, if it is a problem, is it something where you'd need to be concerned not just with shared objects, but also with shared pages (ie. would you need some measure of control over pages being updated from multiple concurrent processes to effectively deal with this issue)?
Lastly, could you summarize how the design of this VM differs from the designs of other multithreaded VMs that have been implemented in the past?
Its exciting to see this kind of innovation happening in the squeak community!
-Stephen
Hello Stephen:
On 06 Nov 2010, at 18:49, Stephen Pair wrote:
The main question that comes to mind for me is concurrency. Does this VM do anything special to preserve the concurrency semantics of smalltalk processes scheduled on a single core? As I'm sure most people are aware, the existing squeak library isn't written with thread safety in mind...and as such, even on a single core a naive implementation can have issues when executing with concurrent processes.
The VM doesn't do anything, thus, you have to take care of ensuring the semantic you want by yourself. The idea is to preserve the standard Smalltalk programming model without introducing new facilities. Thus, what you get with the RoarVM is what you had before, i.e., processes + mutexes/semaphores, and in addition to that your processes can be executed in parallel.
I agree, that is a very low-level programming model, but that is the right foundation for us to experiment with ideas like Ly and Sly. Ly aims to be a language in which you program by embracing non-determenism and the goal is to find ways to benefit from the parallelism and the natural non-determinism of the system.
The solution is generally to try and isolate the objects running in different, concurrent processes and use message passing and replication as needed. If this VM doesn't do anything in this regard, I would expect that it would present an even greater risk of issues stemming from concurrency and it would make it all the more important to keep objects accessible in different processes cleanly separated.
There are certain actor implementations for Smalltalk, I guess, you could get them working on the RoarVM. At the moment, there isn't any special VM support for these programming models. However, there are ideas and plans to enable language designers to build such languages in an efficient manner by providing VM support for a flexible notion of encapsulation.
Another question that comes to mind is how much of a performance hit you might see from the architecture trying to maintain cache consistency in the face of multiple processes simultaneously updating shared memory. Is that something you've found to be an issue?
Sure, that is a big issue. It is already problematic for performance on x86 systems with a few cores, and even worse on Tilera.
Just imagine a simple counter that needs to be updated atomically by 63 other cores... There is no way to make such a counter scale on any system. The only thing you can do is to not use such a counter. In 99% of the cases you don't need it anyway.
As Igor pointed out, if you want performance, you will avoid shared mutable state. The solution for such a counter would be to have local counters, and you synchronize only to get a global sum and only when it is really really necessary. But the optimal solution is very application specific.
Is this something you would have to be careful about when crafting code? And, if it is a problem, is it something where you'd need to be concerned not just with shared objects, but also with shared pages (ie. would you need some measure of control over pages being updated from multiple concurrent processes to effectively deal with this issue)?
Locality is how I would name the problem, and together with the notion of encapsulation, it is something I am currently looking into.
A brief description of what I am up to can be found here: http://soft.vub.ac.be/~smarr/2010/07/doctoral-symposium-at-splash-2010/ Or even more fluffy here: http://soft.vub.ac.be/~smarr/2010/08/poster-at-splash10/
Lastly, could you summarize how the design of this VM differs from the designs of other multithreaded VMs that have been implemented in the past?
Compared to your standard JVM or .NET CLR, the RoarVM provides a similar programming model, but the implementation is designed to experiment on the TILE architecture. And to reach that goal, the VM is kept simple. The heap is divided up into number_of_core parts which are each owned by a single core, and then split into a read-mostly and a read-write heap. That is an optimization for the TILE64 chip with its restricted caching scheme. A few more details have been discussed on this thread already.
Best regards Stefan
vm-dev@lists.squeakfoundation.org