[squeak-dev] ByteArray accessors for 64-bit manipulation
Levente Uzonyi
leves at elte.hu
Wed Sep 9 21:11:18 UTC 2015
After thinking a bit more about this, I came to the conclusion that while
I paid attention to handle reader methods in 64-bit images correctly,
some of the writer methods are incorrect for "large" SmallInteger inputs.
First I implemented the missing parts for 61-bit SmallIntegers using a
general purpose loop, but it turned out that the loop is significantly
faster for smaller numbers, and only slightly slower for larger ones, so I
decided to nuke the special case for 30-bit SmallIntegers.
I also found further possiblities to optimize writes, so the average is
down from 259 to 236 milliseconds. Chris Muller's benchmark gives
{
'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' .
'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' .
'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' .
'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}
Which means +20% speed for smallN writes, and +18% for largeN access.
Levente
P.S.: I still couldn't test the code in my 64-bit Spur image, because it
won't respond to any input after startup.
On Tue, 8 Sep 2015, Levente Uzonyi wrote:
> Hi Chris,
>
> I added the #normalize sends to avoid the creation of spurious LargeIntegers
> in 64-bit images (there are two places where I relied on #-'s ability to work
> on unnormalized input). I didn't have a chance to test it, but I expect it to
> work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be
> any slower than in 32-bit.
>
> Levente
>
> On Tue, 8 Sep 2015, Chris Cunningham wrote:
>
>> Levente,
>> Interesting. I have a question and a concern about your implementation,
>> though:
>>
>> Question: why in the micro checks, is the Write faster than the Access:
>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>> yet in your more thorough benchmark, the Write twice as slow as the Access?
>> "unsigned long 64"
>> (put, or Write) "average asFloat 536.109375".
>> (Access) "average asFloat 259.359375"
>> any ideas?
>>
>> the concern is that your code is nicely optimized for our current 32bit vm
>> - but once we go to 64 bit, I think it will fail. Should we be
>> concerned?
>>
>> -cbc
>>
>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <leves at elte.hu> wrote:
>> Hi All,
>>
>> A bit later than I wanted to, but I've finally uploaded my versions
>> to the Trunk. I guess I went as far as possible with getting the
>> "fastest implementation".
>> I modified your benchmark to use the same numbers, so that the
>> measurements could be repeated. I got the following:
>>
>> Before:
>> {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
>> run.'.
>> 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per
>> run.'.
>> 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
>> 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per
>> run.'}.
>>
>> After:
>> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
>> run.'.
>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per
>> run.'.
>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per
>> run.'.
>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per
>> run.'}.
>>
>> As you can see, everything became faster except for smallN access.
>> This is the side-effect of optimizing for the average case instead
>> of specific cases - like zero bytes. I decided not to use that trick,
>> because it decreased the overall performance.
>>
>> I also wrote a benchmark which measures reads and writes together. It
>> generates random numbers which can be represented using a given
>> number of bits. The result is an array of run times where values
>> having and odd index belong to big-endian access, and even ones to
>> little-endian.
>>
>> | byteArray inputs random storageBits unsigned |
>> Smalltalk garbageCollect.
>> random := Random seed: 36rSqueak.
>> storageBits := 64.
>> unsigned := true.
>> byteArray := ByteArray new: storageBits // 8 * 2.
>> inputs := Array new: 100000.
>> (2 to: storageBits * 2 + 1) collect: [ :descriptor |
>> "lowest bit describes endianness, the rest the number of
>> bits."
>> | limit bigEndian offset |
>> bigEndian := descriptor odd.
>> limit := 1 << (descriptor >> 1) - 1.
>> unsigned
>> ifTrue: [ offset := -1 ]
>> ifFalse: [ offset := -1- (limit >> 1) ].
>> inputs replace: [ :each | (random nextInt: limit) + offset ].
>> [ 1 to: byteArray size - (storageBits // 8 - 1) do: [
>> :startIndex |
>> 1 to: inputs size do: [ :inputIndex |
>> byteArray
>> unsignedLong64At: startIndex
>> put: (inputs at: inputIndex)
>> bigEndian: bigEndian;
>> unsignedLong64At: startIndex
>> bigEndian: bigEndian ] ] ]
>> timeToRun ].
>>
>> I ran it with various accessors and got the following results:
>>
>> "short"
>> #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
>> 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
>> #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20
>> 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>>
>> "long"
>> #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
>> 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
>> 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584)
>> "average asFloat 87.28125".
>> #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72
>> 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
>> 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122)
>> "average asFloat 73.625".
>>
>> "unsigned short"
>> #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
>> 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
>> #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
>> 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>>
>> "unsigned long"
>> #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
>> 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
>> 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694)
>> "average asFloat 82.40625".
>> #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62
>> 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
>> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298)
>> "average asFloat 69.09375".
>>
>> "unsigned long 64"
>> #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312
>> 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
>> 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324
>> 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
>> 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626
>> 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
>> 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834
>> 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
>> 536.109375".
>> #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166
>> 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
>> 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170
>> 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
>> 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294
>> 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
>> 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400
>> 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
>> 259.359375"
>>
>>
>> Levente
>>
>> On Sun, 30 Aug 2015, Chris Muller wrote:
>>
>> Hi Chris, I think these methods belong in the image with the
>> fastest
>> implementation we can do.
>>
>> I implemented 64-bit unsigned access for Ma Serializer back in
>> 2005.
>> I modeled my implementation after Andreas' original approach
>> which
>> tries to avoid LI arithmetic. I was curious whether your
>> implementations would be faster, because if they are then it
>> could
>> benefit Magma. After loading "Ma Serializer" 1.5 (or head)
>> into a
>> trunk image, I used the following script to take comparison
>> measurements:
>>
>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>> raisedTo: 14)) atRandom.
>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>> maBa := ByteArray new: 8.
>> cbBa := ByteArray new: 8.
>> maBa maUint: 64 at: 0 put: largeN.
>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
>> 1
>> bigEndian: false).
>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>> bigEndian: false] bench.
>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ]
>> bench.
>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false. ] bench.
>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>> bigEndian: false] bench.
>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ]
>> bench.
>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false ] bench.
>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>> }
>>
>> Here are the results:
>>
>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per
>> run.' .
>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per
>> run.' .
>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds
>> per run.' .
>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds
>> per run.' .
>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per
>> run.' .
>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per
>> run.' .
>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds
>> per run.' .
>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per
>> run.'
>>
>> It looks like your 64-bit access is 86% faster for accessing
>> the
>> high-end of the 64-bit range, but slower in the other 3
>> metrics.
>> Noticeably, it was only 14% as fast for writing the high-end of
>> the
>> 64-bit range, and similarly as much slower for small-number
>> access..
>>
>>
>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>> <cunningham.cb at gmail.com> wrote:
>> Hi.
>>
>> I've committed a change to the inbox with changes to
>> allow getting/putting
>> 64bit values to ByteArrays (similar to 32 and 16 bit
>> accessors). Could this
>> be added to trunk?
>>
>> Also, first time I used the selective commit function -
>> very nice! the
>> changes I didn't want committed didn't, in fact, get
>> commited. Just the
>> desirable bits!
>>
>> -cbc
>>
>>
>>
>>
>>
>>
>>
>>
>
More information about the Squeak-dev
mailing list
|