[squeakdev] Integer arithmetic and bit operations in Squeak and Pharo (32bit & 64bit)
Eliot Miranda
eliot.miranda at gmail.com
Mon Jan 9 02:24:32 UTC 2017
...and by the way, I got 3.4e9 on 32bit (down from 2.7e9) when I
launched it after the 64bit system and ran the benchmark while the 64bit
system was still running. So run on as quiet a machine as possible.
On Sun, Jan 8, 2017 at 4:07 PM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:
> Hi Benoit,
>
> On Fri, Jan 6, 2017 at 11:59 PM, Benoit StJean <bstjean at yahoo.com> wrote:
>
>> Hello guys,
>>
>> A few questions/comments/remarks about integer arithmetic and bit
>> operations. And I guess many of my interrogations most likely concern the
>> guys building the VM/primitives...
>>
>> I'm working on a chess engine that will rely heavily on bitboard
>> manipulations, i.e. lots of integer arithmetic with bit operations. I was
>> comparing the 32bit & 64bit Spur VM for Squeak & Pharo (I want the chess
>> engine to be portable on Squeak & Pharo). As I pointed out earlier on on
>> the Squeakdev mailing list in october 2016, there are a few things that
>> look strange to me and/or that I do not understand.
>>
>> 1) #bitXor: performs the same on the 32bit & 64bit VM for
>> LargePositiveInteger. BUT, on average, #bitXor: is 15 (fifteen) times
>> *slower* on LargePositiveInteger than with SmallInteger, for both 32 & 64
>> bit VM/Image. Besides, with SmallInteger, #bitXor: is 4.5 faster on 32bit
>> as compared to 64bit!!
>>
>> You can test (on 32bit & 64bit) that with the following snippet:
>>
>>  small1 small2 small3 large1 large2 n timeSmall timeLarge 
>> small1 := 2r1111111111111111. "65535, (1<<16)1"
>> small2 := (1 << 30)  1. "Largest SmallInteger on 32bit VM/Image,
>> SmallInteger maxVal"
>> small3 := 1152921504606846975. "Largest Integer on 64bit VM/Image,
>> SmallInteger maxVal"
>> large1 := (1 << 64)1. "64 bits set to 1"
>> large2 := (1 << 80)1. "80 bits set to 1"
>>
>> n := 100000000.
>> timeSmall := Time millisecondsToRun: [n timesRepeat: [ small1 bitXor:
>> small2]].
>> timeLarge := Time millisecondsToRun: [n timesRepeat: [ large1 bitXor:
>> large2]].
>> Transcript cr;show: 'Time LargePositiveInteger : ', timeLarge printString.
>> Transcript cr;show: 'Time SmallInteger : ', timeSmall printString.
>>
>
> It's really important when running microbenchmarks like this to minimize
> or eliminate other costs. So you should unwind your inner loops, to at
> least to ten repetitions or more. And it's wise to measure the time taken
> for an empty loop. In these cases I'd consider comparing a loop containing
> 10 instances of the operation against one containing one, e.g. for a time
> in seconds you could use
>
>
>  small1 small2 n 
> n := 1000000.
> small1 := SmallInteger minVal.
> small2 := SmallInteger maxVal.
> (Time millisecondsToRun: [1 to: n do: [:i small1 bitXor: small2. small1
> bitXor: small2. small1 bitXor: small2. small1 bitXor: small2. small1
> bitXor: small2. small1 bitXor: small2. small1 bitXor: small2. small1
> bitXor: small2. small1 bitXor: small2. small1 bitXor: small2. small1
> bitXor: small2]])  (Time millisecondsToRun: [1 to: n do: [:i small1
> bitXor: small2]]) / 10000.0 / n
>
> When I do this I get 2.4e9 for 64bit Squeak and 2.7e9 for 32bit
> Squeak. Note that I used 1 to: n do: [:ignored rather than timesRepeat:
> because I know its inlined and timesRepeat: isn't always. But in any case
> I don't see the large difference in SmallInteger operation times that you
> do. Could you repeat your measurements taking something similar to my
> approach?
>
>
>>
>> As was pointed out by Nicolas Cellier in a private communication, one
>> workaround is to add the method #bitXor: in the class LargePositiveInteger
>> to get a 3x performance boost. Nicolas recommended the following code:
>>
>> LargePositiveInteger>>bitXor: arg
>> <primitive:36>
>> ^super bitXor: arg
>>
>> It's all good and nice (and I do get a 3x speedup, in 32&64 bit) but then
>> I wonder in which respect primitive 36 is different from the one
>> LargePositiveInteger usually inherits from Integer :
>>
>> (Integer>>bitXor:)
>> <primitive: 'primDigitBitXor' module:'LargeIntegers'>
>>
>> Why does Nicolas' workaround is able to do the job (calling primitive 36)
>> when there is already a primitive for the exact thing (<primitive:
>> 'primDigitBitXor' module:'LargeIntegers'>) ? Is this duplicate code or
>> there is a reason for this? And why is the primDigitBitXor primitive so
>> slow as compared to primitive 36?
>>
>
> Primitive 36 deals with only 64bit values (up to 8 byte LargeIntegers). <primitive:
> 'primDigitBitXor' module:'LargeIntegers'> deals with arbitrary sized large
> integers.
>
>>
>> 2) One would normally expect the 64bit VM to be faster than the 32bit
>> VM. That's true in all cases except... Well, first, some numbers (Test
>> case attached to this email if you want to reproduce)
>>
>
> It depends on the computation. One should expect the 64bit VM to be
> faster for certain kinds of arithmetic because it can do more in one
> operation, e.g. 60bit arithmetic should be much faster in 64bits than in
> 32bits, most floating point should be much faster because of the immediate
> floatingpoint representation. But consider a purely symbolic computation,
> say tree walking. The 64bit version has to move twice as much data as the
> 32bit version, so if the working set is large but amenable to execution in
> 32bits one would expect the 32bit application to run faster because it
> accesses half the data.
>
>
>
>>
>> 32bit
>> Number of #allMask: per second : 6.20M
>> Number of #anyMask: per second : 7.17M
>> Number of #bitAnd: per second : 8.45M
>> Number of #bitAt: per second : 55.15M
>> Number of #bitAt:put: per second : 37.22M
>> Number of #bitClear: per second : 5.18M
>> Number of #bitInvert per second : 6.53M
>> Number of #bitOr: per second : 9.18M
>> Number of #bitXor: per second : 8.97M
>> Number of #highBit per second : 43.23M
>> Number of #<< per second : 11.34M
>> Number of #lowBit per second : 69.44M
>> Number of #noMask: per second : 7.40M
>> Number of #>> per second : 12.36M
>>
>> 64bit
>> Number of #allMask: per second : 10.26M
>> Number of #anyMask: per second : 10.37M
>> Number of #bitAnd: per second : 17.00M
>> Number of #bitAt: per second : 15.89M
>> Number of #bitAt:put: per second : 10.36M
>> Number of #bitClear: per second : 6.44M
>> Number of #bitInvert per second : 9.11M
>> Number of #bitOr: per second : 18.45M
>> Number of #bitXor: per second : 15.38M
>> Number of #highBit per second : 7.66M
>> Number of #<< per second : 10.50M
>> Number of #lowBit per second : 13.49M
>> Number of #noMask: per second : 11.47M
>> Number of #>> per second : 10.52M
>>
>> There are a few surprises here. First, #bitAt:, #bitAt:put:, #highBit
>> and #lowBit are a *lot faster* on the 32bit VM than on the 64bit VM! Does
>> anyone have an explanation for this? Is this normal?
>>
>
> Can you rerun the numbers using my approach before we attempt to answer
> this?
>
>
>>
>> The other surprise is that #>> and #<< seem to be about the same speed on
>> both VM. Any reason why the 64bit version isn't faster than the 32bit one?
>>
>> 3) I was surprised to find out that SmallInteger>>#maxVal wasn't a
>> constant (or the same on both VM). It's (1<<60)1 on the 64bit VM and
>> (1<<30)1 on the 32bit VM. So, be warned if your code depends on
>> SmallInteger>>#maxVal for some reason!
>>
>> 4) Is there any rationale/reason/explanation as to why, in Pharo,
>> LargePositiveInteger inherits from the class LargeInteger (which doesn't
>> exist in Squeak) ?
>>
>> Thanks in advance.
>>
>>
>> 
>> BenoĆ®t StJean
>> Yahoo! Messenger: bstjean
>> Twitter: @BenLeChialeux
>> Pinterest: benoitstjean
>> Instagram: Chef_Benito
>> IRC: lamneth
>> Blogue: endormitoire.wordpress.com
>> "A standpoint is an intellectual horizon of radius zero". (A. Einstein)
>>
>>
>>
>>
>
>
> 
> _,,,^..^,,,_
> best, Eliot
>

_,,,^..^,,,_
best, Eliot
 next part 
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeakdev/attachments/20170108/5dc2b0b7/attachment.html>
More information about the Squeakdev
mailing list
