[squeak-dev] ByteArray accessors for 64-bit manipulation

Eliot Miranda eliot.miranda at gmail.com
Wed Sep 9 23:25:38 UTC 2015


Hi Levente,

Sent from my iPhone

> On Sep 9, 2015, at 2:11 PM, Levente Uzonyi <leves at elte.hu> wrote:
> 
> After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs.
> First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers.
> I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives
> 
> {
> 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' .
> 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' .
> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' .
> 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}
> 
> Which means +20% speed for smallN writes, and +18% for largeN access.
> 
> Levente
> 
> P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.

IIRC that's due to a bug in earlier versions of the bootstrap in converting negative integers, or perhaps larger positive smallintegers.  Anyway I *think* that the latest Spur 64 but VM image combination works properly now.  Apologies.

> 
>> On Tue, 8 Sep 2015, Levente Uzonyi wrote:
>> 
>> Hi Chris,
>> 
>> I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.
>> 
>> Levente
>> 
>>> On Tue, 8 Sep 2015, Chris Cunningham wrote:
>>> 
>>> Levente,
>>> Interesting.  I have a question and a concern about your implementation, though:
>>> Question: why in the micro checks, is the Write faster than the Access:
>>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>>> yet in your more thorough benchmark, the Write twice as slow as the Access?
>>> "unsigned long 64"
>>> (put, or Write) "average asFloat 536.109375".
>>> (Access) "average asFloat 259.359375"
>>> any ideas?
>>> the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail.  Should we be
>>> concerned?
>>> -cbc
>>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <leves at elte.hu> wrote:
>>>      Hi All,
>>> 
>>>      A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
>>>      "fastest implementation".
>>>      I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
>>> 
>>>      Before:
>>>      {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
>>>      'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
>>>      'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
>>>      'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
>>> 
>>>      After:
>>>      {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>>>      'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>>>      'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>>>      'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>>> 
>>>      As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
>>>      of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
>>> 
>>>      I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
>>>      number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
>>>      little-endian.
>>> 
>>>      | byteArray inputs random storageBits unsigned |
>>>      Smalltalk garbageCollect.
>>>      random := Random seed: 36rSqueak.
>>>      storageBits := 64.
>>>      unsigned := true.
>>>      byteArray := ByteArray new: storageBits // 8 * 2.
>>>      inputs := Array new: 100000.
>>>      (2 to: storageBits * 2 + 1) collect: [ :descriptor |
>>>              "lowest bit describes endianness, the rest the number of bits."
>>>              | limit bigEndian offset |
>>>              bigEndian := descriptor odd.
>>>              limit := 1 << (descriptor >> 1) - 1.
>>>              unsigned
>>>                      ifTrue: [ offset := -1 ]
>>>                      ifFalse: [ offset := -1- (limit >> 1) ].
>>>              inputs replace: [ :each | (random nextInt: limit) + offset ].
>>>              [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
>>>                      1 to: inputs size do: [ :inputIndex |
>>>                              byteArray
>>>                                      unsignedLong64At: startIndex
>>>                                              put: (inputs at: inputIndex)
>>>                                              bigEndian: bigEndian;
>>>                                      unsignedLong64At: startIndex
>>>                                              bigEndian: bigEndian ] ] ] timeToRun ].
>>> 
>>>      I ran it with various accessors and got the following results:
>>> 
>>>      "short"
>>>      #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
>>>      #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>>> 
>>>      "long"
>>>      #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
>>>      72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
>>>      #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
>>>      74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
>>> 
>>>      "unsigned short"
>>>      #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
>>>      #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>>> 
>>>      "unsigned long"
>>>      #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
>>>      60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
>>>       #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
>>>      64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
>>> 
>>>      "unsigned long 64"
>>>      #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
>>>      316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
>>>      636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
>>>      822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
>>>      536.109375".
>>>      #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
>>>      170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
>>>      284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
>>>      388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
>>>      259.359375"
>>> 
>>>      Levente
>>> 
>>>      On Sun, 30 Aug 2015, Chris Muller wrote:
>>> 
>>>            Hi Chris, I think these methods belong in the image with the fastest
>>>            implementation we can do.
>>> 
>>>            I implemented 64-bit unsigned access for Ma Serializer back in 2005.
>>>            I modeled my implementation after Andreas' original approach which
>>>            tries to avoid LI arithmetic.  I was curious whether your
>>>            implementations would be faster, because if they are then it could
>>>            benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
>>>            trunk image, I used the following script to take comparison
>>>            measurements:
>>> 
>>>            | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
>>>            raisedTo: 14)) atRandom.
>>>            largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>>            maBa := ByteArray new: 8.
>>>            cbBa := ByteArray new: 8.
>>>            maBa maUint: 64 at: 0 put: largeN.
>>>            cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>>            self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
>>>            bigEndian: false).
>>>            { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>>            bigEndian: false] bench.
>>>            'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>>            'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
>>>            'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>            'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>>            bigEndian: false] bench.
>>>            'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>>            'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
>>>            'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>            }
>>> 
>>>            Here are the results:
>>> 
>>>            'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per run.' .
>>>            'ma smallN write'->'4,770,000 per second.  210 nanoseconds per run.' .
>>>            'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per run.' .
>>>            'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds per run.' .
>>>            'cbc largeN write'->'907,000 per second.  1.1 microseconds per run.' .
>>>            'ma largeN write'->'6,620,000 per second.  151 nanoseconds per run.' .
>>>            'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per run.' .
>>>            'ma largeN access'->'1,020,000 per second.  982 nanoseconds per run.'
>>> 
>>>            It looks like your 64-bit access is 86% faster for accessing the
>>>            high-end of the 64-bit range, but slower in the other 3 metrics.
>>>            Noticeably, it was only 14% as fast for writing the high-end of the
>>>            64-bit range, and similarly as much slower for small-number access..
>>> 
>>>            On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>>            <cunningham.cb at gmail.com> wrote:
>>>                  Hi.
>>> 
>>>                  I've committed a change to the inbox with changes to allow getting/putting
>>>                  64bit values to ByteArrays (similar to 32 and 16 bit accessors).  Could this
>>>                  be added to trunk?
>>> 
>>>                  Also, first time I used the selective commit function - very nice!  the
>>>                  changes I didn't want committed didn't, in fact, get commited.  Just the
>>>                  desirable bits!
>>> 
>>>                  -cbc
> 


More information about the Squeak-dev mailing list