Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has its cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the variation, but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 27824" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a 32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any benefit for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before doing the comparison. I even subtracted two from its result to match the result of #digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version. However, DataAndTime equality check did not improve for 32-bit images, so I ran it under AndreasSystemProfiler. Profiling showed that large integer equality checks spends time mostly in primDigitCompare, which is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk Spur on Linux:
| a b c d | a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote:
David T. Lewis uploaded a new version of Kernel to project The Inbox: http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
==================== Summary ====================
Name: Kernel-dtl.1015 Author: dtl Time: 13 April 2016, 6:57:22.56608 pm UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e Ancestors: Kernel-mt.1014
Make large integer equality test be about 13 times faster. Implement #= in LargePositiveInteger, and use digitAt: (primitive 60) for the comparison.
=============== Diff against Kernel-mt.1014 ===============
Item was added:
- ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
- = aNumber
- aNumber class == self class ifTrue: [
aNumber size = self size ifFalse: [ ^false ].
self size to: 1 by: -1 do: [ :i | (aNumber digitAt: i) =
(self digitAt: i) ifFalse: [ ^ false ] ].
^ true ].
- aNumber isInteger ifTrue: [ ^false ].
- aNumber isNumber ifFalse: [ ^false ].
- ^aNumber adaptToInteger: self andCompare: #=!
I don't know if the apparent slowdown in newer VMs is significant. It might be that some inefficiences have crept in to the primitive calling, but it may just be differences in the compilers or compiler optimization settings that were used when they were compiled.
Overall, I would expect that primitive calling will be faster in Cog, and primitive execution time should be reasonably similar in any version of the VM. The differences between Cog and classic interpreter suggest that the cost of calling a primitive (especially a small primitive) can be relatively high.
I have a feeling that we are reaching a point in which the jit optimization may exceed the performance of compiled primitives. It seems quite reasonable to me that a jitted method might outperform the compiled primitive version of the same method, because the jit version can completely avoid the overhead of setting up the call to the compiled primitive.
If that is so, then we may have primitives that actually slow the system down when running on Cog.
What if we find that some primitives are useful if and only if we are running without the jit? In that case, we might want to call the primitives when running on Cog, and not call them when running on a plain StackInterpreter.
This is probably a dumb idea, but I'll toss it out anyway. I wonder if there might be a way to cause certain primitives to be called only if they are needed for a non-jit VM? For example, in Squeak we have Environments, and we have pragmas. So I wonder if there might be some way to do something like this:
<primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
Dave
On Thu, Apr 14, 2016 at 07:04:47PM +0200, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has its cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the variation, but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 27824" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a 32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any benefit for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before doing the comparison. I even subtracted two from its result to match the result of #digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version. However, DataAndTime equality check did not improve for 32-bit images, so I ran it under AndreasSystemProfiler. Profiling showed that large integer equality checks spends time mostly in primDigitCompare, which is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk Spur on Linux:
| a b c d | a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote:
David T. Lewis uploaded a new version of Kernel to project The Inbox: http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
==================== Summary ====================
Name: Kernel-dtl.1015 Author: dtl Time: 13 April 2016, 6:57:22.56608 pm UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e Ancestors: Kernel-mt.1014
Make large integer equality test be about 13 times faster. Implement #= in LargePositiveInteger, and use digitAt: (primitive 60) for the comparison.
=============== Diff against Kernel-mt.1014 ===============
Item was added:
- ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
- = aNumber
- aNumber class == self class ifTrue: [
aNumber size = self size ifFalse: [ ^false ].
self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
i) = (self digitAt: i) ifFalse: [ ^ false ] ].
^ true ].
- aNumber isInteger ifTrue: [ ^false ].
- aNumber isNumber ifFalse: [ ^false ].
- ^aNumber adaptToInteger: self andCompare: #=!
Hi Dave,
On Apr 14, 2016, at 5:09 PM, David T. Lewis lewis@mail.msen.com wrote:
I don't know if the apparent slowdown in newer VMs is significant. It might be that some inefficiences have crept in to the primitive calling, but it may just be differences in the compilers or compiler optimization settings that were used when they were compiled.
Overall, I would expect that primitive calling will be faster in Cog, and primitive execution time should be reasonably similar in any version of the VM. The differences between Cog and classic interpreter suggest that the cost of calling a primitive (especially a small primitive) can be relatively high.
I have a feeling that we are reaching a point in which the jit optimization may exceed the performance of compiled primitives. It seems quite reasonable to me that a jitted method might outperform the compiled primitive version of the same method, because the jit version can completely avoid the overhead of setting up the call to the compiled primitive.
If that is so, then we may have primitives that actually slow the system down when running on Cog.
What if we find that some primitives are useful if and only if we are running without the jit? In that case, we might want to call the primitives when running on Cog, and not call them when running on a plain StackInterpreter.
This is probably a dumb idea, but I'll toss it out anyway. I wonder if there might be a way to cause certain primitives to be called only if they are needed for a non-jit VM? For example, in Squeak we have Environments, and we have pragmas. So I wonder if there might be some way to do something like this:
<primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods. The JIT VM could maintain a "do not call" list, or simply not include certain primitives. What we need is accurate measurements of with and without. I can then update the "do-not-call" list.
One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT. Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.
Dave
_,,,^..^,,,_ (phone)
On Thu, Apr 14, 2016 at 07:04:47PM +0200, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has its cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the variation, but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 27824" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote: Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a 32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any benefit for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote: Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before doing the comparison. I even subtracted two from its result to match the result of #digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version. However, DataAndTime equality check did not improve for 32-bit images, so I ran it under AndreasSystemProfiler. Profiling showed that large integer equality checks spends time mostly in primDigitCompare, which is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk Spur on Linux:
| a b c d | a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote: > David T. Lewis uploaded a new version of Kernel to project The Inbox: > http://source.squeak.org/inbox/Kernel-dtl.1015.mcz > > ==================== Summary ==================== > > Name: Kernel-dtl.1015 > Author: dtl > Time: 13 April 2016, 6:57:22.56608 pm > UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e > Ancestors: Kernel-mt.1014 > > Make large integer equality test be about 13 times faster. Implement #= > in LargePositiveInteger, and use digitAt: (primitive 60) for the > comparison. > > =============== Diff against Kernel-mt.1014 =============== > > Item was added: > + ----- Method: LargePositiveInteger>>= (in category 'comparing') ----- > + = aNumber > + > + aNumber class == self class ifTrue: [ > + aNumber size = self size ifFalse: [ ^false ]. > + self size to: 1 by: -1 do: [ :i | (aNumber digitAt: > i) = > (self digitAt: i) ifFalse: [ ^ false ] ]. > + ^ true ]. > + aNumber isInteger ifTrue: [ ^false ]. > + aNumber isNumber ifFalse: [ ^false ]. > + ^aNumber adaptToInteger: self andCompare: #=!
On Thu, Apr 14, 2016 at 08:01:41PM -0700, Eliot Miranda wrote:
Hi Dave,
On Apr 14, 2016, at 5:09 PM, David T. Lewis lewis@mail.msen.com wrote:
I don't know if the apparent slowdown in newer VMs is significant. It might be that some inefficiences have crept in to the primitive calling, but it may just be differences in the compilers or compiler optimization settings that were used when they were compiled.
Overall, I would expect that primitive calling will be faster in Cog, and primitive execution time should be reasonably similar in any version of the VM. The differences between Cog and classic interpreter suggest that the cost of calling a primitive (especially a small primitive) can be relatively high.
I have a feeling that we are reaching a point in which the jit optimization may exceed the performance of compiled primitives. It seems quite reasonable to me that a jitted method might outperform the compiled primitive version of the same method, because the jit version can completely avoid the overhead of setting up the call to the compiled primitive.
If that is so, then we may have primitives that actually slow the system down when running on Cog.
What if we find that some primitives are useful if and only if we are running without the jit? In that case, we might want to call the primitives when running on Cog, and not call them when running on a plain StackInterpreter.
This is probably a dumb idea, but I'll toss it out anyway. I wonder if there might be a way to cause certain primitives to be called only if they are needed for a non-jit VM? For example, in Squeak we have Environments, and we have pragmas. So I wonder if there might be some way to do something like this:
<primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods. The JIT VM could maintain a "do not call" list, or simply not include certain primitives. What we need is accurate measurements of with and without. I can then update the "do-not-call" list.
One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT. Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.
<meta> It is really quite remarkable that we could be having this discussion at all. An open source virtual machine for Smalltalk that is so fast that is not even worth the trouble to call out to C primitives? Really?!? Wow. </meta>
Dave
On Fri, Apr 15, 2016 at 11:27 AM, David T. Lewis lewis@mail.msen.com wrote:
On Thu, Apr 14, 2016 at 08:01:41PM -0700, Eliot Miranda wrote:
Hi Dave,
On Apr 14, 2016, at 5:09 PM, David T. Lewis lewis@mail.msen.com wrote:
I don't know if the apparent slowdown in newer VMs is significant. It might be that some inefficiences have crept in to the primitive calling, but it may just be differences in the compilers or compiler optimization settings that were used when they were compiled.
Overall, I would expect that primitive calling will be faster in Cog, and primitive execution time should be reasonably similar in any version of the VM. The differences between Cog and classic interpreter suggest that the cost of calling a primitive (especially a small primitive) can be relatively high.
I have a feeling that we are reaching a point in which the jit optimization may exceed the performance of compiled primitives. It seems quite reasonable to me that a jitted method might outperform the compiled primitive version of the same method, because the jit version can completely avoid the overhead of setting up the call to the compiled primitive.
If that is so, then we may have primitives that actually slow the system down when running on Cog.
What if we find that some primitives are useful if and only if we are running without the jit? In that case, we might want to call the primitives when running on Cog, and not call them when running on a plain StackInterpreter.
This is probably a dumb idea, but I'll toss it out anyway. I wonder if there might be a way to cause certain primitives to be called only if they are needed for a non-jit VM? For example, in Squeak we have Environments, and we have pragmas. So I wonder if there might be some way to do something like this:
<primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods. The JIT VM could maintain a "do not call" list, or simply not include certain primitives. What we need is accurate measurements of with and without. I can then update the "do-not-call" list.
One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT. Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.
<meta> It is really quite remarkable that we could be having this discussion at all. An open source virtual machine for Smalltalk that is so fast that is not even worth the trouble to call out to C primitives? Really?!? Wow. </meta>
Dave
If it turns out to be true, it would certainly be a good meme for the marketing department to work with.
cheers -ben
On 15.04.2016, at 05:27, David T. Lewis lewis@mail.msen.com wrote:
<meta> It is really quite remarkable that we could be having this discussion at all. An open source virtual machine for Smalltalk that is so fast that is not even worth the trouble to call out to C primitives? Really?!? Wow. </meta>
You would get a kick out of taking a closer look at the RSqueak VM. It doesn’t even use the primitives for BitBlt but runs the Smalltalk code.
- Bert -
On Fri, Apr 15, 2016 at 11:51:21AM +0200, Bert Freudenberg wrote:
On 15.04.2016, at 05:27, David T. Lewis lewis@mail.msen.com wrote:
<meta> It is really quite remarkable that we could be having this discussion at all. An open source virtual machine for Smalltalk that is so fast that is not even worth the trouble to call out to C primitives? Really?!? Wow. </meta>
You would get a kick out of taking a closer look at the RSqueak VM. It doesn???t even use the primitives for BitBlt but runs the Smalltalk code.
You're right, I would like to take a closer look at RSqueak. It's getting hard to keep up with all of this progress :-)
Dave
What's RSqueak VM?
Where to find it?
Phil
On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg bert@freudenbergs.de wrote:
On 15.04.2016, at 05:27, David T. Lewis lewis@mail.msen.com wrote:
<meta> It is really quite remarkable that we could be having this discussion at all. An open source virtual machine for Smalltalk that is so fast that is not even worth the trouble to call out to C primitives? Really?!? Wow. </meta>
You would get a kick out of taking a closer look at the RSqueak VM. It doesn’t even use the primitives for BitBlt but runs the Smalltalk code.
- Bert -
Hi,
RSqueakVM is a research Squeak VM written in RPython, with a tracing JIT compiler supporting x86 32 and 64bit, armv6, armv7, and powerpc (although we have deactivated non-x86-32 builds in the CI right now, because it takes forever to run through qemu-chroot and nobody has time to work on these platforms right now).
RSqueak supports all image formats starting with Squeak 2 all the way up to current trunk. So you can have a jitted mini image, if you find a filein with the Bitblt code in Smalltalk, because we don't have the bitblt plugin ;)
The JIT is fast enough to omit almost all non-essential primitives, and simulate many primitives (like Balloon and BitBlt) by running the Slang code instead of a plugin. In fact, we don't have large integer, misc, bitblt, or balloon plugins, instead relying on the fallback and vmmaker code to make execution fast. For many cases this works quite well, although warmup still is a big issue for us (the framerate of the image keeps steadily climbing for the first few *minutes* that you use it - there is just a lot of stuff that needs to be jitted for bitblt, balloon, and fontrendering to get fast).
We are currently preparing an alpha release as a one-click bundle for people to try it out.
Development is currently here: https://github.com/HPI-SWA-Lab/RSqueak. Binaries are built on each commit automatically, but you also need SDL2 and a current image with VMMaker (we need a number of fixes in vmmaker and the fallback code - so older images need these fixes somehow filed in, too - we're also preparing changesets for those)
philippeback wrote
What's RSqueak VM?
Where to find it?
Phil
On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg <
bert@
> wrote:
On 15.04.2016, at 05:27, David T. Lewis <
lewis@.msen
> wrote:
<meta> > > It is really quite remarkable that we could be having this discussion > > at all. An open source virtual machine for Smalltalk that is so fast > that > > is not even worth the trouble to call out to C primitives? Really?!? > Wow. > > </meta> > > You would get a kick out of taking a closer look at the RSqueak VM. It > doesn’t even use the primitives for BitBlt but runs the Smalltalk code. > > - Bert - > > > > >
-- View this message in context: http://forum.world.st/primitiveDigitCompare-is-slow-was-Re-squeak-dev-The-In... Sent from the Squeak VM mailing list archive at Nabble.com.
(changing the subject line)
Thanks very much for this overview of RSqueakVM!
Dave
On Fri, Apr 15, 2016 at 06:22:03AM -0700, timfelgentreff wrote:
Hi,
RSqueakVM is a research Squeak VM written in RPython, with a tracing JIT compiler supporting x86 32 and 64bit, armv6, armv7, and powerpc (although we have deactivated non-x86-32 builds in the CI right now, because it takes forever to run through qemu-chroot and nobody has time to work on these platforms right now).
RSqueak supports all image formats starting with Squeak 2 all the way up to current trunk. So you can have a jitted mini image, if you find a filein with the Bitblt code in Smalltalk, because we don't have the bitblt plugin ;)
The JIT is fast enough to omit almost all non-essential primitives, and simulate many primitives (like Balloon and BitBlt) by running the Slang code instead of a plugin. In fact, we don't have large integer, misc, bitblt, or balloon plugins, instead relying on the fallback and vmmaker code to make execution fast. For many cases this works quite well, although warmup still is a big issue for us (the framerate of the image keeps steadily climbing for the first few *minutes* that you use it - there is just a lot of stuff that needs to be jitted for bitblt, balloon, and fontrendering to get fast).
We are currently preparing an alpha release as a one-click bundle for people to try it out.
Development is currently here: https://github.com/HPI-SWA-Lab/RSqueak. Binaries are built on each commit automatically, but you also need SDL2 and a current image with VMMaker (we need a number of fixes in vmmaker and the fallback code - so older images need these fixes somehow filed in, too - we're also preparing changesets for those)
philippeback wrote
What's RSqueak VM?
Where to find it?
Phil
On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg <
bert@
> wrote:
On 15.04.2016, at 05:27, David T. Lewis <
lewis@.msen
> wrote:
<meta> > > It is really quite remarkable that we could be having this discussion > > at all. An open source virtual machine for Smalltalk that is so fast > that > > is not even worth the trouble to call out to C primitives? Really?!? > Wow. > > </meta> > > You would get a kick out of taking a closer look at the RSqueak VM. It > doesn???t even use the primitives for BitBlt but runs the Smalltalk code. > > - Bert - > > > > >
-- View this message in context: http://forum.world.st/primitiveDigitCompare-is-slow-was-Re-squeak-dev-The-In... Sent from the Squeak VM mailing list archive at Nabble.com.
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using #primitiveDigitCompare is slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has its cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the variation, but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 27824" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a 32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any benefit for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before doing the comparison. I even subtracted two from its result to match the result of #digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version. However, DataAndTime equality check did not improve for 32-bit images, so I ran it under AndreasSystemProfiler. Profiling showed that large integer equality checks spends time mostly in primDigitCompare, which is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk Spur on Linux:
| a b c d | a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote:
David T. Lewis uploaded a new version of Kernel to project The Inbox: http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
==================== Summary ====================
Name: Kernel-dtl.1015 Author: dtl Time: 13 April 2016, 6:57:22.56608 pm UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e Ancestors: Kernel-mt.1014
Make large integer equality test be about 13 times faster. Implement #= in LargePositiveInteger, and use digitAt: (primitive 60) for the comparison.
=============== Diff against Kernel-mt.1014 ===============
Item was added:
- ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
- = aNumber
- aNumber class == self class ifTrue: [
aNumber size = self size ifFalse: [ ^false ].
self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
i) = (self digitAt: i) ifFalse: [ ^ false ] ].
^ true ].
- aNumber isInteger ifTrue: [ ^false ].
- aNumber isNumber ifFalse: [ ^false ].
- ^aNumber adaptToInteger: self andCompare: #=!
It does seem odd that it would be different on Spur. You suggested that compiler optimization might be set differently for the two plugins. I cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some kind of autotools problem that I can't figure out right now), but if you are able to do a VM compile and save the console output to a file, then I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm, but first comment out the part of mvm that asks if you want to clean).
Aside from that, the actual generated code for primDigitCompare is doing quite a lot of stuff before it ever gets around to comparing the digits. I suspect it is not a very efficient primitive.
Dave
On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using #primitiveDigitCompare is slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has its cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the variation, but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224" Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 27824" Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a 32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any benefit for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before doing the comparison. I even subtracted two from its result to match the result of #digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version. However, DataAndTime equality check did not improve for 32-bit images, so I ran it under AndreasSystemProfiler. Profiling showed that large integer equality checks spends time mostly in primDigitCompare, which is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk Spur on Linux:
| a b c d | a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676" Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883" Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote: >David T. Lewis uploaded a new version of Kernel to project The Inbox: >http://source.squeak.org/inbox/Kernel-dtl.1015.mcz > >==================== Summary ==================== > >Name: Kernel-dtl.1015 >Author: dtl >Time: 13 April 2016, 6:57:22.56608 pm >UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e >Ancestors: Kernel-mt.1014 > >Make large integer equality test be about 13 times faster. Implement >#= >in LargePositiveInteger, and use digitAt: (primitive 60) for the >comparison. > >=============== Diff against Kernel-mt.1014 =============== > >Item was added: >+ ----- Method: LargePositiveInteger>>= (in category 'comparing') >----- >+ = aNumber >+ >+ aNumber class == self class ifTrue: [ >+ aNumber size = self size ifFalse: [ ^false ]. >+ self size to: 1 by: -1 do: [ :i | (aNumber digitAt: >i) = >(self digitAt: i) ifFalse: [ ^ false ] ]. >+ ^ true ]. >+ aNumber isInteger ifTrue: [ ^false ]. >+ aNumber isNumber ifFalse: [ ^false ]. >+ ^aNumber adaptToInteger: self andCompare: #=! >
Hi, the primitive does nothing very special 1) check for SmallInteger cases first, for quick return 2) check for LargeIntegers length then if both receiver & argument are Large 3) check for LargeIntegers digits then if both have same length
None of these 3 steps is expected to be slow.
A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits... I got these with 32bits Cog Sput MacOSX :
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '8,980,000 per second. 111 nanoseconds per run.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '16,400,000 per second. 60.8 nanoseconds per run.'
Obviously, most time is spent elsewhere, it would be interesting to know where exactly.
Nicolas
2016-04-16 18:52 GMT+02:00 David T. Lewis lewis@mail.msen.com:
It does seem odd that it would be different on Spur. You suggested that compiler optimization might be set differently for the two plugins. I cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some kind of autotools problem that I can't figure out right now), but if you are able to do a VM compile and save the console output to a file, then I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm, but first comment out the part of mvm that asks if you want to clean).
Aside from that, the actual generated code for primDigitCompare is doing quite a lot of stuff before it ever gets around to comparing the digits. I suspect it is not a very efficient primitive.
Dave
On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using #primitiveDigitCompare is slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has
its
cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the
variation,
but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
26669"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
25052"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
59224"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
27824"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM
on a
32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any
benefit
for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==>
3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868" Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775" Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur [ 7432154326465436 digitCompare: 8432154326465436 ] bench. returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up: '20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we
have
another primitive to compare two bytes-objects. One which ought be slower than #primDigitCompare:, because it maps the bytes before
doing
the comparison. I even subtracted two from its result to match the
result of
#digitCompare:
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:. '9,590,000 per second. 104 nanoseconds per run.'. So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
>I would appreciate a review before moving this to trunk. > >Background: > >In UTCDateAndTime, most things are much faster than the trunk
version.
>However, DataAndTime equality check did not improve for 32-bit
images,
>so I ran it under AndreasSystemProfiler. Profiling showed that large >integer equality checks spends time mostly in primDigitCompare,
which
>is inefficient when only a simple byte comparison is needed. > >Here is the performance difference that I see on my system, 32-bit >trunk >Spur on Linux: > >| a b c d | >a := 7432154326465436. "a big number" >b := a + 1. "low order digit changed" >c := 8432154326465436. "high order digit changed" >d := 7432154026465436. "a digit in the middle changed" > >"Base performance in the trunk image" >Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
63733"
>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
63152"
>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
63581"
> >"Performance after adding LargePositiveInteger>>=" >Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
4676"
>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
4883"
>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
4512"
> >Dave > > > >On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org >wrote: >>David T. Lewis uploaded a new version of Kernel to project The
Inbox:
>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz >> >>==================== Summary ==================== >> >>Name: Kernel-dtl.1015 >>Author: dtl >>Time: 13 April 2016, 6:57:22.56608 pm >>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e >>Ancestors: Kernel-mt.1014 >> >>Make large integer equality test be about 13 times faster.
Implement
>>#= >>in LargePositiveInteger, and use digitAt: (primitive 60) for the >>comparison. >> >>=============== Diff against Kernel-mt.1014 =============== >> >>Item was added: >>+ ----- Method: LargePositiveInteger>>= (in category 'comparing') >>----- >>+ = aNumber >>+ >>+ aNumber class == self class ifTrue: [ >>+ aNumber size = self size ifFalse: [ ^false ]. >>+ self size to: 1 by: -1 do: [ :i | (aNumber digitAt: >>i) = >>(self digitAt: i) ifFalse: [ ^ false ] ]. >>+ ^ true ]. >>+ aNumber isInteger ifTrue: [ ^false ]. >>+ aNumber isNumber ifFalse: [ ^false ]. >>+ ^aNumber adaptToInteger: self andCompare: #=! >> > >
Ah but maybe the "Smart" side of the plugin is causing trouble:
firstInteger := self primitive: 'primDigitCompare' parameters: #(#Integer ) receiver: #Integer.
translates as:
success(isKindOf(stackValue(0), "Integer")); secondInteger = stackValue(0); /* missing DebugCode */; success(isKindOf(stackValue(1), "Integer")); firstInteger = stackValue(1);
It might be faster to just check the 3 cases:
(interpreterProxy isIntegerObject: oop) or: [oopClass := interpreterProxy fetchClassOf: oop. oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].
Moreover, we already test isIntegerObject further in the primitive code. Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.
2016-04-16 19:37 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
Hi, the primitive does nothing very special
- check for SmallInteger cases first, for quick return
- check for LargeIntegers length then if both receiver & argument are
Large 3) check for LargeIntegers digits then if both have same length
None of these 3 steps is expected to be slow.
A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits... I got these with 32bits Cog Sput MacOSX :
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '8,980,000 per second. 111 nanoseconds per run.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '16,400,000 per second. 60.8 nanoseconds per run.'
Obviously, most time is spent elsewhere, it would be interesting to know where exactly.
Nicolas
2016-04-16 18:52 GMT+02:00 David T. Lewis lewis@mail.msen.com:
It does seem odd that it would be different on Spur. You suggested that compiler optimization might be set differently for the two plugins. I cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some kind of autotools problem that I can't figure out right now), but if you are able to do a VM compile and save the console output to a file, then I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm, but first comment out the part of mvm that asks if you want to clean).
Aside from that, the actual generated code for primDigitCompare is doing quite a lot of stuff before it ever gets around to comparing the digits. I suspect it is not a very efficient primitive.
Dave
On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using #primitiveDigitCompare
is
slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code
(primitiveCompareString
with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive has
its
cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the
variation,
but one conclusion is clear: my suggested "optimization" in the inbox is a bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
26669"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
25052"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
59224"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
27824"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM
on a
32-bit image (with loop count smaller because the interpreter VM is slower than Spur). The change that I put in the inbox does not have any
benefit
for the interpreter VM:
a := 7432154326465436. "a big number" b := a + 1. "low order digit changed" c := 8432154326465436. "high order digit changed" d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3844"
Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3786"
Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==>
3800"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3868"
Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3775"
Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==>
3770"
So yes it is something related to the VM. But I do not understand how #primDigitCompare could be so slow? As you say, maybe something is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote: >Hi Dave, > >I guess this is a VM related issue. On 64-bit Spur >[ 7432154326465436 digitCompare: 8432154326465436 ] bench. >returns '5,420,000 per second. 184 nanoseconds per run.'. > >Removing the primitive call from #digitCompare: the number goes up: >'20,200,000 per second. 49.4 nanoseconds per run.'. > >You might think that it's okay because the JIT is so good. But we
have
>another primitive to compare two bytes-objects. One which ought be >slower than #primDigitCompare:, because it maps the bytes before
doing
>the >comparison. I even subtracted two from its result to match the
result of
>#digitCompare: > >| order | >order := (0 to: 255) as: ByteArray. >[ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
>order) - 2 ] bench. > >But it's still about twice as quick as #primDigitCompare:. >'9,590,000 per second. 104 nanoseconds per run.'. >So, something must be wrong with #primDigitCompare:. > >Levente > >On Wed, 13 Apr 2016, David T. Lewis wrote: > >>I would appreciate a review before moving this to trunk. >> >>Background: >> >>In UTCDateAndTime, most things are much faster than the trunk
version.
>>However, DataAndTime equality check did not improve for 32-bit
images,
>>so I ran it under AndreasSystemProfiler. Profiling showed that
large
>>integer equality checks spends time mostly in primDigitCompare,
which
>>is inefficient when only a simple byte comparison is needed. >> >>Here is the performance difference that I see on my system, 32-bit >>trunk >>Spur on Linux: >> >>| a b c d | >>a := 7432154326465436. "a big number" >>b := a + 1. "low order digit changed" >>c := 8432154326465436. "high order digit changed" >>d := 7432154026465436. "a digit in the middle changed" >> >>"Base performance in the trunk image" >>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
63733"
>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
63152"
>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
63581"
>> >>"Performance after adding LargePositiveInteger>>=" >>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
4676"
>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
4883"
>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
4512"
>> >>Dave >> >> >> >>On Wed, Apr 13, 2016 at 10:57:28PM +0000,
commits@source.squeak.org
>>wrote: >>>David T. Lewis uploaded a new version of Kernel to project The
Inbox:
>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz >>> >>>==================== Summary ==================== >>> >>>Name: Kernel-dtl.1015 >>>Author: dtl >>>Time: 13 April 2016, 6:57:22.56608 pm >>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e >>>Ancestors: Kernel-mt.1014 >>> >>>Make large integer equality test be about 13 times faster.
Implement
>>>#= >>>in LargePositiveInteger, and use digitAt: (primitive 60) for the >>>comparison. >>> >>>=============== Diff against Kernel-mt.1014 =============== >>> >>>Item was added: >>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing') >>>----- >>>+ = aNumber >>>+ >>>+ aNumber class == self class ifTrue: [ >>>+ aNumber size = self size ifFalse: [ ^false ]. >>>+ self size to: 1 by: -1 do: [ :i | (aNumber
digitAt:
>>>i) = >>>(self digitAt: i) ifFalse: [ ^ false ] ]. >>>+ ^ true ]. >>>+ aNumber isInteger ifTrue: [ ^false ]. >>>+ aNumber isNumber ifFalse: [ ^false ]. >>>+ ^aNumber adaptToInteger: self andCompare: #=! >>> >> >>
Bingo, here are the results without isKindOf checks:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '16,100,000 per second. 62.1 nanoseconds per run.'
that is more or less the speed of ByteArray comparison.
I think I will make a new pass on LargeIntegersPlugin :)
2016-04-16 22:54 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
Ah but maybe the "Smart" side of the plugin is causing trouble:
firstInteger := self primitive: 'primDigitCompare' parameters: #(#Integer ) receiver: #Integer.
translates as:
success(isKindOf(stackValue(0), "Integer")); secondInteger = stackValue(0); /* missing DebugCode */; success(isKindOf(stackValue(1), "Integer")); firstInteger = stackValue(1);
It might be faster to just check the 3 cases:
(interpreterProxy isIntegerObject: oop) or: [oopClass := interpreterProxy fetchClassOf: oop. oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].
Moreover, we already test isIntegerObject further in the primitive code. Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.
2016-04-16 19:37 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
Hi, the primitive does nothing very special
- check for SmallInteger cases first, for quick return
- check for LargeIntegers length then if both receiver & argument are
Large 3) check for LargeIntegers digits then if both have same length
None of these 3 steps is expected to be slow.
A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits... I got these with 32bits Cog Sput MacOSX :
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '8,980,000 per second. 111 nanoseconds per run.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '16,400,000 per second. 60.8 nanoseconds per run.'
Obviously, most time is spent elsewhere, it would be interesting to know where exactly.
Nicolas
2016-04-16 18:52 GMT+02:00 David T. Lewis lewis@mail.msen.com:
It does seem odd that it would be different on Spur. You suggested that compiler optimization might be set differently for the two plugins. I cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some kind of autotools problem that I can't figure out right now), but if you are able to do a VM compile and save the console output to a file, then I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm, but first comment out the part of mvm that asks if you want to clean).
Aside from that, the actual generated code for primDigitCompare is doing quite a lot of stuff before it ever gets around to comparing the digits. I suspect it is not a very efficient primitive.
Dave
On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using #primitiveDigitCompare
is
slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been
around
for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.' and '4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code
(primitiveCompareString
with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is
compiled
with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive
has its
cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
Oops, I just realized that my "interpreter VM" results were from a debugging VM with compiler optimization off (too many VMs, sorry). Here are the results for a more representative interpreter VM. I can't explain the
variation,
but one conclusion is clear: my suggested "optimization" in the inbox is
a
bad idea (and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
26669"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
25052"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
25275"
"Performance after adding LargePositiveInteger>>=" Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
59224"
Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
27824"
Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote: >Hi Levente, > >I think you may be right. I repeated my test with an interpreter VM
on a
>32-bit image (with loop count smaller because the interpreter VM is >slower >than Spur). The change that I put in the inbox does not have any
benefit
>for the interpreter VM: > > > a := 7432154326465436. "a big number" > b := a + 1. "low order digit changed" > c := 8432154326465436. "high order digit changed" > d := 7432154026465436. "a digit in the middle changed" > > "Base performance in the trunk level V3 image with interpreter
VM"
> Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3844"
> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3786"
> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==>
3800"
> > "Performance after adding LargePositiveInteger>>=" > Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3868"
> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3775"
> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==>
3770"
> >So yes it is something related to the VM. But I do not understand >how #primDigitCompare could be so slow? As you say, maybe something >is wrong with it. > >Thank you, I am glad that I asked for a review :-) > >Dave > > >On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote: >>Hi Dave, >> >>I guess this is a VM related issue. On 64-bit Spur >>[ 7432154326465436 digitCompare: 8432154326465436 ] bench. >>returns '5,420,000 per second. 184 nanoseconds per run.'. >> >>Removing the primitive call from #digitCompare: the number goes up: >>'20,200,000 per second. 49.4 nanoseconds per run.'. >> >>You might think that it's okay because the JIT is so good. But we
have
>>another primitive to compare two bytes-objects. One which ought be >>slower than #primDigitCompare:, because it maps the bytes before
doing
>>the >>comparison. I even subtracted two from its result to match the
result of
>>#digitCompare: >> >>| order | >>order := (0 to: 255) as: ByteArray. >>[ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
>>order) - 2 ] bench. >> >>But it's still about twice as quick as #primDigitCompare:. >>'9,590,000 per second. 104 nanoseconds per run.'. >>So, something must be wrong with #primDigitCompare:. >> >>Levente >> >>On Wed, 13 Apr 2016, David T. Lewis wrote: >> >>>I would appreciate a review before moving this to trunk. >>> >>>Background: >>> >>>In UTCDateAndTime, most things are much faster than the trunk
version.
>>>However, DataAndTime equality check did not improve for 32-bit
images,
>>>so I ran it under AndreasSystemProfiler. Profiling showed that
large
>>>integer equality checks spends time mostly in primDigitCompare,
which
>>>is inefficient when only a simple byte comparison is needed. >>> >>>Here is the performance difference that I see on my system, 32-bit >>>trunk >>>Spur on Linux: >>> >>>| a b c d | >>>a := 7432154326465436. "a big number" >>>b := a + 1. "low order digit changed" >>>c := 8432154326465436. "high order digit changed" >>>d := 7432154026465436. "a digit in the middle changed" >>> >>>"Base performance in the trunk image" >>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
63733"
>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
63152"
>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
63581"
>>> >>>"Performance after adding LargePositiveInteger>>=" >>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
4676"
>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
4883"
>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
4512"
>>> >>>Dave >>> >>> >>> >>>On Wed, Apr 13, 2016 at 10:57:28PM +0000,
commits@source.squeak.org
>>>wrote: >>>>David T. Lewis uploaded a new version of Kernel to project The
Inbox:
>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz >>>> >>>>==================== Summary ==================== >>>> >>>>Name: Kernel-dtl.1015 >>>>Author: dtl >>>>Time: 13 April 2016, 6:57:22.56608 pm >>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e >>>>Ancestors: Kernel-mt.1014 >>>> >>>>Make large integer equality test be about 13 times faster.
Implement
>>>>#= >>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the >>>>comparison. >>>> >>>>=============== Diff against Kernel-mt.1014 =============== >>>> >>>>Item was added: >>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing') >>>>----- >>>>+ = aNumber >>>>+ >>>>+ aNumber class == self class ifTrue: [ >>>>+ aNumber size = self size ifFalse: [ ^false ]. >>>>+ self size to: 1 by: -1 do: [ :i | (aNumber
digitAt:
>>>>i) = >>>>(self digitAt: i) ifFalse: [ ^ false ] ]. >>>>+ ^ true ]. >>>>+ aNumber isInteger ifTrue: [ ^false ]. >>>>+ aNumber isNumber ifFalse: [ ^false ]. >>>>+ ^aNumber adaptToInteger: self andCompare: #=! >>>> >>> >>>
Hi Nicolas,
IMO we need a different API for checking classes, something that via macros compiles to direct class access for internal plugins.
I was also going to suggest using the VM Profiler:
"The VMProfiler will tell you exactly where: http://source.squeak.org/VMMaker/CogTools then screen menu open... VM Profiler then put an expression in the bottom right hand box, and hit Profile, then in the left=hand list of function names, select VM report"
But the UnixOSProcessPlugin as compiled by my mac build scripts and run under Spur on Mac OS X does not appear to load. The VM Profiler depends on OS Process to extract addresses from shared libraries. I'm looking at it now :-(
On Sat, Apr 16, 2016 at 2:44 PM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
Bingo, here are the results without isKindOf checks:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '16,100,000 per second. 62.1 nanoseconds per run.'
that is more or less the speed of ByteArray comparison.
I think I will make a new pass on LargeIntegersPlugin :)
2016-04-16 22:54 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
Ah but maybe the "Smart" side of the plugin is causing trouble:
firstInteger := self primitive: 'primDigitCompare' parameters: #(#Integer ) receiver: #Integer.
translates as:
success(isKindOf(stackValue(0), "Integer")); secondInteger = stackValue(0); /* missing DebugCode */; success(isKindOf(stackValue(1), "Integer")); firstInteger = stackValue(1);
It might be faster to just check the 3 cases:
(interpreterProxy isIntegerObject: oop) or: [oopClass := interpreterProxy fetchClassOf: oop. oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].
Moreover, we already test isIntegerObject further in the primitive code. Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.
2016-04-16 19:37 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
Hi, the primitive does nothing very special
- check for SmallInteger cases first, for quick return
- check for LargeIntegers length then if both receiver & argument are
Large 3) check for LargeIntegers digits then if both have same length
None of these 3 steps is expected to be slow.
A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits... I got these with 32bits Cog Sput MacOSX :
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '8,980,000 per second. 111 nanoseconds per run.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench. '16,400,000 per second. 60.8 nanoseconds per run.'
Obviously, most time is spent elsewhere, it would be interesting to know where exactly.
Nicolas
2016-04-16 18:52 GMT+02:00 David T. Lewis lewis@mail.msen.com:
It does seem odd that it would be different on Spur. You suggested that compiler optimization might be set differently for the two plugins. I cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some kind of autotools problem that I can't figure out right now), but if you are able to do a VM compile and save the console output to a file, then I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm, but first comment out the part of mvm that asks if you want to clean).
Aside from that, the actual generated code for primDigitCompare is doing quite a lot of stuff before it ever gets around to comparing the digits. I suspect it is not a very efficient primitive.
Dave
On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
This thread got derailed, and I feel like we should go back to its main point, which was that #digitCompare: using
#primitiveDigitCompare is
slow. It's so slow, that #primitiveCompareString is about twice as quick in Spur VMs, which is odd.
Levente
On Thu, 14 Apr 2016, Levente Uzonyi wrote:
Hi Dave,
I dag a bit deeper into this problem, and I found that it's been
around
for ages. I compared the two primitives in three older images. The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench. '6,880,000 per second.'
| order | order := (0 to: 255) as: ByteArray. [ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
order) - 2 ] bench. '11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 3.10.2. The VM didn't have the primitives required to fetch the VM information. The result were '8.22459348130374e6 per second.' and '1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic Interpreter VM. The results were '3.911442911417717e6 per second.'
and
'4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code
(primitiveCompareString
with a subtraction) was quicker than the simpler code (primitiveDigitCompare), I suspect that LargeIntegersPlugin is
compiled
with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the performance got worse over the years. I think invoking a primitive
has its
cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
>Oops, I just realized that my "interpreter VM" results were from a >debugging >VM with compiler optimization off (too many VMs, sorry). Here are
the
>results >for a more representative interpreter VM. I can't explain the
variation,
>but >one conclusion is clear: my suggested "optimization" in the inbox
is a
>bad idea >(and I will move it to the treated inbox). > > "Base performance in the trunk level V3 image with interpreter VM" > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
26669"
> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
25052"
> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
25275"
> > "Performance after adding LargePositiveInteger>>=" > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==>
59224"
> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==>
27824"
> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==>
44324"
> >Dave > > > >On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote: >>Hi Levente, >> >>I think you may be right. I repeated my test with an interpreter
VM on a
>>32-bit image (with loop count smaller because the interpreter VM is >>slower >>than Spur). The change that I put in the inbox does not have any
benefit
>>for the interpreter VM: >> >> >> a := 7432154326465436. "a big number" >> b := a + 1. "low order digit changed" >> c := 8432154326465436. "high order digit changed" >> d := 7432154026465436. "a digit in the middle changed" >> >> "Base performance in the trunk level V3 image with interpreter
VM"
>> Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3844"
>> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3786"
>> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==>
3800"
>> >> "Performance after adding LargePositiveInteger>>=" >> Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==>
3868"
>> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==>
3775"
>> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==>
3770"
>> >>So yes it is something related to the VM. But I do not understand >>how #primDigitCompare could be so slow? As you say, maybe something >>is wrong with it. >> >>Thank you, I am glad that I asked for a review :-) >> >>Dave >> >> >>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote: >>>Hi Dave, >>> >>>I guess this is a VM related issue. On 64-bit Spur >>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench. >>>returns '5,420,000 per second. 184 nanoseconds per run.'. >>> >>>Removing the primitive call from #digitCompare: the number goes
up:
>>>'20,200,000 per second. 49.4 nanoseconds per run.'. >>> >>>You might think that it's okay because the JIT is so good. But we
have
>>>another primitive to compare two bytes-objects. One which ought be >>>slower than #primDigitCompare:, because it maps the bytes before
doing
>>>the >>>comparison. I even subtracted two from its result to match the
result of
>>>#digitCompare: >>> >>>| order | >>>order := (0 to: 255) as: ByteArray. >>>[ (ByteString compare: 7432154326465436 with: 8432154326465436
collated:
>>>order) - 2 ] bench. >>> >>>But it's still about twice as quick as #primDigitCompare:. >>>'9,590,000 per second. 104 nanoseconds per run.'. >>>So, something must be wrong with #primDigitCompare:. >>> >>>Levente >>> >>>On Wed, 13 Apr 2016, David T. Lewis wrote: >>> >>>>I would appreciate a review before moving this to trunk. >>>> >>>>Background: >>>> >>>>In UTCDateAndTime, most things are much faster than the trunk
version.
>>>>However, DataAndTime equality check did not improve for 32-bit
images,
>>>>so I ran it under AndreasSystemProfiler. Profiling showed that
large
>>>>integer equality checks spends time mostly in primDigitCompare,
which
>>>>is inefficient when only a simple byte comparison is needed. >>>> >>>>Here is the performance difference that I see on my system,
32-bit
>>>>trunk >>>>Spur on Linux: >>>> >>>>| a b c d | >>>>a := 7432154326465436. "a big number" >>>>b := a + 1. "low order digit changed" >>>>c := 8432154326465436. "high order digit changed" >>>>d := 7432154026465436. "a digit in the middle changed" >>>> >>>>"Base performance in the trunk image" >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
63733"
>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
63152"
>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
63581"
>>>> >>>>"Performance after adding LargePositiveInteger>>=" >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==>
4676"
>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==>
4883"
>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==>
4512"
>>>> >>>>Dave >>>> >>>> >>>> >>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000,
commits@source.squeak.org
>>>>wrote: >>>>>David T. Lewis uploaded a new version of Kernel to project The
Inbox:
>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz >>>>> >>>>>==================== Summary ==================== >>>>> >>>>>Name: Kernel-dtl.1015 >>>>>Author: dtl >>>>>Time: 13 April 2016, 6:57:22.56608 pm >>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e >>>>>Ancestors: Kernel-mt.1014 >>>>> >>>>>Make large integer equality test be about 13 times faster.
Implement
>>>>>#= >>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the >>>>>comparison. >>>>> >>>>>=============== Diff against Kernel-mt.1014 =============== >>>>> >>>>>Item was added: >>>>>+ ----- Method: LargePositiveInteger>>= (in category
'comparing')
>>>>>----- >>>>>+ = aNumber >>>>>+ >>>>>+ aNumber class == self class ifTrue: [ >>>>>+ aNumber size = self size ifFalse: [ ^false ]. >>>>>+ self size to: 1 by: -1 do: [ :i | (aNumber
digitAt:
>>>>>i) = >>>>>(self digitAt: i) ifFalse: [ ^ false ] ]. >>>>>+ ^ true ]. >>>>>+ aNumber isInteger ifTrue: [ ^false ]. >>>>>+ aNumber isNumber ifFalse: [ ^false ]. >>>>>+ ^aNumber adaptToInteger: self andCompare: #=! >>>>> >>>> >>>> > >
2016-04-16 23:54 GMT+02:00 Eliot Miranda eliot.miranda@gmail.com:
Hi Nicolas,
IMO we need a different API for checking classes, something that via macros compiles to direct class access for internal plugins.
I was also going to suggest using the VM Profiler:
"The VMProfiler will tell you exactly where: http://source.squeak.org/VMMaker/CogTools then screen menu open... VM Profiler then put an expression in the bottom right hand box, and hit Profile, then in the left=hand list of function names, select VM report"
But the UnixOSProcessPlugin as compiled by my mac build scripts and run under Spur on Mac OS X does not appear to load. The VM Profiler depends on OS Process to extract addresses from shared libraries. I'm looking at it now :-(
Great, one more tool to look at :)
-- _,,,^..^,,,_ best, Eliot
I just played around a bit with the code generator and I think there other reasons why the code is slow: - the use of function calls instead of macros for isIntegerObject, slotSizeOf, success, failed, stackValue. I suspect that the VM internally uses macros for these in numbered primitives. - #cDigitCompare:with:len: doesn't get inlined, even though it's a tail call. That's probably a limitation of the Slang inliner, or a bug.
Some functions are generated as static, even though they are never called, because they have been inlined during the code generation. E.g.: #digitCompareLarge:with:. While this doesn't have a direct effect on execution time, it may affect the CPU cache. I think the best would be to not generate any C code from such methods.
Levente
On Sun, 17 Apr 2016, Levente Uzonyi wrote:
I just played around a bit with the code generator and I think there other reasons why the code is slow:
- the use of function calls instead of macros for isIntegerObject,
slotSizeOf, success, failed, stackValue. I suspect that the VM internally uses macros for these in numbered primitives.
Perhaps gcc can automatically inline these with the -flto switch.
Levente
- #cDigitCompare:with:len: doesn't get inlined, even though it's a tail call.
That's probably a limitation of the Slang inliner, or a bug.
Some functions are generated as static, even though they are never called, because they have been inlined during the code generation. E.g.: #digitCompareLarge:with:. While this doesn't have a direct effect on execution time, it may affect the CPU cache. I think the best would be to not generate any C code from such methods.
Levente
2016-04-17 0:36 GMT+02:00 Levente Uzonyi leves@caesar.elte.hu:
I just played around a bit with the code generator and I think there other reasons why the code is slow:
- the use of function calls instead of macros for isIntegerObject,
slotSizeOf, success, failed, stackValue. I suspect that the VM internally uses macros for these in numbered primitives.
Hi Levente, But currently we generate single code source for plugins whatever VM flavour, so the macro if any should be defined in VM specific repository... That sounds a bit like what Eliot was suggesting for fetching the class though (if I understood correctly).
My opinion is that suppressing function call overhead will make little difference since these calls are pre-conditions sent once or twice per primitive.
- #cDigitCompare:with:len: doesn't get inlined, even though it's a tail
call. That's probably a limitation of the Slang inliner, or a bug.
I'm working on Slang type inference and inlining (they are closely bound), but I don't think it really matters for this specific call. More so for unsafeDigitAt & co which might be used in tight loops.
Some functions are generated as static, even though they are never called, because they have been inlined during the code generation. E.g.: #digitCompareLarge:with:. While this doesn't have a direct effect on execution time, it may affect the CPU cache. I think the best would be to not generate any C code from such methods.
I wonder if I did not see code for eliminating these... I'll check again.
Levente
Hi Nicolas,
It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).
Levente
AFAICT, cDigitCompare:with:len: is inlined by clang. But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation. The procedure I used for checking was:
vim ../common/Makefile.flags (add option -S to CFLAGS) touch ../../src/plugins/LargeIntegers/LargeIntegers.c mvm -f less build/LargeIntegers/LargeIntegers.o
And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive. So inlining should make a measurable difference for "small" large integers. I was biased by giant integers which is more what I'm after (tight loops)
Nicolas
2016-04-17 0:59 GMT+02:00 Levente Uzonyi leves@caesar.elte.hu:
Hi Nicolas,
It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).
Levente
2016-04-17 22:43 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
AFAICT, cDigitCompare:with:len: is inlined by clang. But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation.
Ah, and this is the purpose of -flto, link-time-optimization, sorry for being slow myself ;)
The procedure I used for checking was:
vim ../common/Makefile.flags (add option -S to CFLAGS) touch ../../src/plugins/LargeIntegers/LargeIntegers.c mvm -f less build/LargeIntegers/LargeIntegers.o
And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive. So inlining should make a measurable difference for "small" large integers. I was biased by giant integers which is more what I'm after (tight loops)
Nicolas
2016-04-17 0:59 GMT+02:00 Levente Uzonyi leves@caesar.elte.hu:
Hi Nicolas,
It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).
Levente
2016-04-17 22:51 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
2016-04-17 22:43 GMT+02:00 Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com>:
AFAICT, cDigitCompare:with:len: is inlined by clang. But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation.
Ah, and this is the purpose of -flto, link-time-optimization, sorry for being slow myself ;)
Hmm, I'm slow, but is -flto stable? If I add it t CFLAGS and to LDFLAGS then I cannot compile Spur on mac, and I cannot decode the error message easily:
0 0x108ee297e __assert_rtn + 144 1 0x108f709d8 ld::tool::HeaderAndLoadCommandsAtom<x86>::sectionFlags(ld::Internal::FinalSection*) const + 782 2 0x108f703a5 ld::tool::HeaderAndLoadCommandsAtom<x86>::copySegmentLoadCommands(unsigned char*) const + 955 3 0x108f6f58c ld::tool::HeaderAndLoadCommandsAtom<x86>::copyRawContent(unsigned char*) const + 146 4 0x108f46e0f ld::tool::OutputFile::writeAtoms(ld::Internal&, unsigned char*) + 465 5 0x108f3fdda ld::tool::OutputFile::writeOutputFile(ld::Internal&) + 822 6 0x108f39e94 ld::tool::OutputFile::write(ld::Internal&) + 178 7 0x108ee38fa main + 1311 A linker snapshot was created at: /tmp/Squeak-2016-03-17-230136.ld-snapshot ld: Assertion failed: (0 && "typeTempLTO should not make it to final linked image"), function sectionFlags, file /Library/Caches/com.apple.xbs/Sources/ld64/ld64-264.3.101/src/ld/HeaderAndLoadCommands.hpp, line 780. clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [build/vm/Squeak] Error 1
The procedure I used for checking was:
vim ../common/Makefile.flags (add option -S to CFLAGS) touch ../../src/plugins/LargeIntegers/LargeIntegers.c mvm -f less build/LargeIntegers/LargeIntegers.o
And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive. So inlining should make a measurable difference for "small" large integers. I was biased by giant integers which is more what I'm after (tight loops)
Nicolas
2016-04-17 0:59 GMT+02:00 Levente Uzonyi leves@caesar.elte.hu:
Hi Nicolas,
It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).
Levente
You have to add -flto to both CFLAGS and LDFLAGS.
I tried to compile it on Ubuntu 14.04, but there's some problem with autoconf. The -flto flag probably optimizes something away and the script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag, which is Mac only, and sqUnixExternalPrims.c won't compile.
Levente
On Mon, 18 Apr 2016, Levente Uzonyi wrote:
You have to add -flto to both CFLAGS and LDFLAGS.
Nevermind. You did.
Levente
I tried to compile it on Ubuntu 14.04, but there's some problem with autoconf. The -flto flag probably optimizes something away and the script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag, which is Mac only, and sqUnixExternalPrims.c won't compile.
Levente
With a bit of manual tweaking I managed to build a vm using the -flto flag (from the 3666 sources). gcc managed to inline stackValue, failed, isIntegerObject, integerObjectOf, but failed to do the same for success, isKindOf, popThenPush, integerValueOf and digitCompareLargewith.
The overall speedup was ~10% for SmallIntegers and ~4-5% for equal 64-bit LargeIntegers, and ~13% for unequal ones (no bytes match).
I added -flto to CFLAGS and -Wl,-O1 -Wl,-flto to LDFLAGS.
Levente
On Mon, 18 Apr 2016, Levente Uzonyi wrote:
You have to add -flto to both CFLAGS and LDFLAGS.
I tried to compile it on Ubuntu 14.04, but there's some problem with autoconf. The -flto flag probably optimizes something away and the script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag, which is Mac only, and sqUnixExternalPrims.c won't compile.
Levente
Hi Levente,
On Mon, Apr 18, 2016 at 8:39 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
With a bit of manual tweaking I managed to build a vm using the -flto flag (from the 3666 sources). gcc managed to inline stackValue, failed, isIntegerObject, integerObjectOf, but failed to do the same for success, isKindOf, popThenPush, integerValueOf and digitCompareLargewith.
The overall speedup was ~10% for SmallIntegers and ~4-5% for equal 64-bit LargeIntegers, and ~13% for unequal ones (no bytes match).
I added -flto to CFLAGS and -Wl,-O1 -Wl,-flto to LDFLAGS.
Excuse me for being lazy and asking you rather than looking, but is there a way to specify the set of functions considered? t would be nice to only apply link-time optimisation to the plain primitive API. For debugging I'd rather /not/ apply link-time optimisation to the interface between the CoInterpreter and the Cogit. But the plugin API is performance-critical enough that I'll put up with reduced debuggability for that part of the system.
Levente
On Mon, 18 Apr 2016, Levente Uzonyi wrote:
You have to add -flto to both CFLAGS and LDFLAGS.
I tried to compile it on Ubuntu 14.04, but there's some problem with autoconf. The -flto flag probably optimizes something away and the script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag, which is Mac only, and sqUnixExternalPrims.c won't compile.
Levente
Hi Eliot,
Unfortunately that's not easily doable, but files compiled without the -flto flag will not be subject of link-time optimization. So, one way to achieve what you want is to compile the methods in a separate file using the -flto flag and the rest in another file without the flag. This rule applies to every function you want to be subject of link-time optimization, let it be the one you want to inline into another function or the one you want to inline another function into. Also, the -flto flag doesn't work well with -g, so it shouldn't be a problem during debugging. If that was what you meant by debugging.
You can find the details here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto-882
Levente
vm-dev@lists.squeakfoundation.org