New subject: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

14 Apr 2016

      Hi Dave,
I dag a bit deeper into this problem, and I found that it's been around 
for ages. I compared the two primitives in three older images.
The first one is Squeak 4.2 running on Cog r2714:
[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
'6,880,000 per second.'
| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
'11,200,000 per second.'
The next one was an even older Cog VM running an image updated from 
3.10.2. The VM didn't have the primitives required to fetch the VM 
information. The result were '8.22459348130374e6 per second.' and
'1.19677724910036e7 per second.', respectively.
The last image was a closure-enabled 3.10.2 running on the classic 
Interpreter VM. The results were '3.911442911417717e6 per second.' and
'4.84567866426715e6 per second.', respectively.
Since in all VMs the seemingly more complex code (primitiveCompareString 
with a subtraction) was quicker than the simpler code 
(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
with less aggressive optimization than MiscPrimitivePlugin.
What's also interesting is that, based on these benchmarks, the 
performance got worse over the years. I think invoking a primitive has its 
cost in newer VMs.
Levente
On Thu, 14 Apr 2016, David T. Lewis wrote:
...
Oops, I just realized that my "interpreter VM" results were from a debugging
VM with compiler optimization off (too many VMs, sorry). Here are the results
for a more representative interpreter VM. I can't explain the variation, but
one conclusion is clear: my suggested "optimization" in the inbox is a bad idea
(and I will move it to the treated inbox).
"Base performance in the trunk level V3 image with interpreter VM"
 Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
 Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
 Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
"Performance after adding LargePositiveInteger>>="
 Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
 Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
 Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
Dave
On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
...
Hi Levente,
I think you may be right. I repeated my test with an interpreter VM on a
32-bit image (with loop count smaller because the interpreter VM is slower
than Spur). The change that I put in the inbox does not have any benefit
for the interpreter VM:
a := 7432154326465436. "a big number"
  b := a + 1. "low order digit changed"
  c := 8432154326465436. "high order digit changed"
  d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk level V3 image with interpreter VM"
  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
"Performance after adding LargePositiveInteger>>="
  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
So yes it is something related to the VM. But I do not understand
how #primDigitCompare could be so slow? As you say, maybe something
is wrong with it.
Thank you, I am glad that I asked for a review :-)
Dave
On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
...
Hi Dave,
I guess this is a VM related issue. On 64-bit Spur
[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
returns '5,420,000 per second. 184 nanoseconds per run.'.
Removing the primitive call from #digitCompare: the number goes up:
'20,200,000 per second. 49.4 nanoseconds per run.'.
You might think that it's okay because the JIT is so good. But we have
another primitive to compare two bytes-objects. One which ought be
slower than #primDigitCompare:, because it maps the bytes before doing the
comparison. I even subtracted two from its result to match the result of
#digitCompare:
| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
order) - 2 ] bench.
But it's still about twice as quick as #primDigitCompare:.
'9,590,000 per second. 104 nanoseconds per run.'.
So, something must be wrong with #primDigitCompare:.
Levente
On Wed, 13 Apr 2016, David T. Lewis wrote:
...
I would appreciate a review before moving this to trunk.
Background:
In UTCDateAndTime, most things are much faster than the trunk version.
However, DataAndTime equality check did not improve for 32-bit images,
so I ran it under AndreasSystemProfiler. Profiling showed that large
integer equality checks spends time mostly in primDigitCompare, which
is inefficient when only a simple byte comparison is needed.
Here is the performance difference that I see on my system, 32-bit trunk
Spur on Linux:
| a b c d |
a := 7432154326465436. "a big number"
b := a + 1. "low order digit changed"
c := 8432154326465436. "high order digit changed"
d := 7432154026465436. "a digit in the middle changed"
"Base performance in the trunk image"
Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
"Performance after adding LargePositiveInteger>>="
Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
Dave
On Wed, Apr 13, 2016 at 10:57:28PM +0000, commits@source.squeak.org wrote:
...
David T. Lewis uploaded a new version of Kernel to project The Inbox:
http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
==================== Summary ====================
Name: Kernel-dtl.1015
Author: dtl
Time: 13 April 2016, 6:57:22.56608 pm
UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
Ancestors: Kernel-mt.1014
Make large integer equality test be about 13 times faster. Implement #=
in LargePositiveInteger, and use digitAt: (primitive 60) for the
comparison.
=============== Diff against Kernel-mt.1014 ===============
Item was added:

----- Method: LargePositiveInteger>>= (in category 'comparing') -----
= aNumber

aNumber class == self class ifTrue: [
aNumber size = self size ifFalse: [ ^false ].

self size to: 1 by: -1 do: [ :i | (aNumber digitAt: i) =

(self digitAt: i) ifFalse: [ ^ false ] ].

^ true ].

aNumber isInteger ifTrue: [ ^false ].
aNumber isNumber ifFalse: [ ^false ].
^aNumber adaptToInteger: self andCompare: #=!