[Vm-dev] primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Fri Apr 15 00:09:08 UTC 2016

I don't know if the apparent slowdown in newer VMs is significant.
It might be that some inefficiences have crept in to the primitive
calling, but it may just be differences in the compilers or compiler
optimization settings that were used when they were compiled.

Overall, I would expect that primitive calling will be faster in
Cog, and primitive execution time should be reasonably similar in
any version of the VM. The differences between Cog and classic
interpreter suggest that the cost of calling a primitive (especially
a small primitive) can be relatively high.

I have a feeling that we are reaching a point in which the jit
optimization may exceed the performance of compiled primitives. It 
seems quite reasonable to me that a jitted method might outperform
the compiled primitive version of the same method, because the jit
version can completely avoid the overhead of setting up the call to the
compiled primitive.

If that is so, then we may have primitives that actually slow the
system down when running on Cog.

What if we find that some primitives are useful if and only if we are
running without the jit? In that case, we might want to call the
primitives when running on Cog, and not call them when running on
a plain StackInterpreter.

This is probably a dumb idea, but I'll toss it out anyway. I wonder
if there might be a way to cause certain primitives to be called
only if they are needed for a non-jit VM? For example, in Squeak
we have Environments, and we have pragmas. So I wonder if there
might be some way to do something like this:

  <primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>

Dave

On Thu, Apr 14, 2016 at 07:04:47PM +0200, Levente Uzonyi wrote:
> 
> Hi Dave,
> 
> I dag a bit deeper into this problem, and I found that it's been around 
> for ages. I compared the two primitives in three older images.
> The first one is Squeak 4.2 running on Cog r2714:
> 
> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> '6,880,000 per second.'
> 
> | order |
> order := (0 to: 255) as: ByteArray.
> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: 
> order) - 2 ] bench.
> '11,200,000 per second.'
> 
> The next one was an even older Cog VM running an image updated from 
> 3.10.2. The VM didn't have the primitives required to fetch the VM 
> information. The result were '8.22459348130374e6 per second.' and
> '1.19677724910036e7 per second.', respectively.
> 
> The last image was a closure-enabled 3.10.2 running on the classic 
> Interpreter VM. The results were '3.911442911417717e6 per second.' and
> '4.84567866426715e6 per second.', respectively.
> 
> Since in all VMs the seemingly more complex code (primitiveCompareString 
> with a subtraction) was quicker than the simpler code 
> (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> with less aggressive optimization than MiscPrimitivePlugin.
> 
> What's also interesting is that, based on these benchmarks, the 
> performance got worse over the years. I think invoking a primitive has its 
> cost in newer VMs.
> 
> Levente
> 
> On Thu, 14 Apr 2016, David T. Lewis wrote:
> 
> >Oops, I just realized that my "interpreter VM" results were from a 
> >debugging
> >VM with compiler optimization off (too many VMs, sorry). Here are the 
> >results
> >for a more representative interpreter VM. I can't explain the variation, 
> >but
> >one conclusion is clear: my suggested "optimization" in the inbox is a bad 
> >idea
> >(and I will move it to the treated inbox).
> >
> > "Base performance in the trunk level V3 image with interpreter VM"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >
> > "Performance after adding LargePositiveInteger>>="
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >
> >Dave
> >
> >
> >
> >On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>Hi Levente,
> >>
> >>I think you may be right. I repeated my test with an interpreter VM on a
> >>32-bit image (with loop count smaller because the interpreter VM is slower
> >>than Spur). The change that I put in the inbox does not have any benefit
> >>for the interpreter VM:
> >>
> >>
> >>  a := 7432154326465436. "a big number"
> >>  b := a + 1. "low order digit changed"
> >>  c := 8432154326465436. "high order digit changed"
> >>  d := 7432154026465436. "a digit in the middle changed"
> >>
> >>   "Base performance in the trunk level V3 image with interpreter VM"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>
> >>  "Performance after adding LargePositiveInteger>>="
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>
> >>So yes it is something related to the VM. But I do not understand
> >>how #primDigitCompare could be so slow? As you say, maybe something
> >>is wrong with it.
> >>
> >>Thank you, I am glad that I asked for a review :-)
> >>
> >>Dave
> >>
> >>
> >>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>Hi Dave,
> >>>
> >>>I guess this is a VM related issue. On 64-bit Spur
> >>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>
> >>>Removing the primitive call from #digitCompare: the number goes up:
> >>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>
> >>>You might think that it's okay because the JIT is so good. But we have
> >>>another primitive to compare two bytes-objects. One which ought be
> >>>slower than #primDigitCompare:, because it maps the bytes before doing 
> >>>the
> >>>comparison. I even subtracted two from its result to match the result of
> >>>#digitCompare:
> >>>
> >>>| order |
> >>>order := (0 to: 255) as: ByteArray.
> >>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>order) - 2 ] bench.
> >>>
> >>>But it's still about twice as quick as #primDigitCompare:.
> >>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>So, something must be wrong with #primDigitCompare:.
> >>>
> >>>Levente
> >>>
> >>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>
> >>>>I would appreciate a review before moving this to trunk.
> >>>>
> >>>>Background:
> >>>>
> >>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>is inefficient when only a simple byte comparison is needed.
> >>>>
> >>>>Here is the performance difference that I see on my system, 32-bit trunk
> >>>>Spur on Linux:
> >>>>
> >>>>| a b c d |
> >>>>a := 7432154326465436. "a big number"
> >>>>b := a + 1. "low order digit changed"
> >>>>c := 8432154326465436. "high order digit changed"
> >>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>
> >>>>"Base performance in the trunk image"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>
> >>>>"Performance after adding LargePositiveInteger>>="
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>
> >>>>Dave
> >>>>
> >>>>
> >>>>
> >>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits at source.squeak.org 
> >>>>wrote:
> >>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>
> >>>>>==================== Summary ====================
> >>>>>
> >>>>>Name: Kernel-dtl.1015
> >>>>>Author: dtl
> >>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>Ancestors: Kernel-mt.1014
> >>>>>
> >>>>>Make large integer equality test be about 13 times faster. Implement #=
> >>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>comparison.
> >>>>>
> >>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>
> >>>>>Item was added:
> >>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
> >>>>>+ = aNumber
> >>>>>+
> >>>>>+ 	aNumber class == self class ifTrue: [
> >>>>>+ 		aNumber size = self size ifFalse: [ ^false ].
> >>>>>+ 		self size to: 1 by: -1 do: [ :i | (aNumber digitAt: 
> >>>>>i) =
> >>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>+ 		^ true ].
> >>>>>+ 	aNumber isInteger ifTrue: [ ^false ].
> >>>>>+ 	aNumber isNumber ifFalse: [ ^false ].
> >>>>>+ 	^aNumber adaptToInteger: self andCompare: #=!
> >>>>>
> >>>>
> >>>>
> >
> >