[squeak-dev] ByteArray accessors for 64-bit manipulation

Tue Sep 8 22:11:33 UTC 2015

Ok, it makes sense now.  Well, it made sense before, I was just slow to see
it.

-cbc

On Tue, Sep 8, 2015 at 9:25 AM, Levente Uzonyi <leves at elte.hu> wrote:

> I forgot to answer your other questions.
>
> On Tue, 8 Sep 2015, Chris Cunningham wrote:
>
> Levente,
>> Interesting.  I have a question and a concern about your implementation,
>> though:
>>
>> Question: why in the micro checks, is the Write faster than the Access:
>>
>
> Because Access means allocation of new objects (LargeIntegers), while
> Write simply means copying bytes.
>
> {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
>> 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
>> 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
>> 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
>> yet in your more thorough benchmark, the Write twice as slow as the
>> Access?
>>
>
> For each pair of lines the first one was measured _before_ my changes, the
> second one was measured _after_ them. The benchmark measures read and write
> together, so 536.109375 stands for the average number of milliseconds to
> read and write ninemillion numbers using your variant of the 64-bit methods
> on my machine.
>
> "unsigned long 64"
>> (put, or Write) "average asFloat 536.109375".
>> (Access) "average asFloat 259.359375"
>> any ideas?
>>
>
> So this is
>
> (read and write Before) "average asFloat 536.109375"
> (read and write After) "average asFloat 259.359375"
>
>
> Levente
>
>
>
>> the concern is that your code is nicely optimized for our current 32bit
>> vm - but once we go to 64 bit, I think it will fail.  Should we be
>> concerned?
>>
>> -cbc
>>
>> On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi <leves at elte.hu> wrote:
>>       Hi All,
>>
>>       A bit later than I wanted to, but I've finally uploaded my versions
>> to the Trunk. I guess I went as far as possible with getting the
>>       "fastest implementation".
>>       I modified your benchmark to use the same numbers, so that the
>> measurements could be repeated. I got the following:
>>
>>       Before:
>>       {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
>> run.'.
>>       'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per
>> run.'.
>>       'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per
>> run.'.
>>       'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per
>> run.'}.
>>
>>       After:
>>       {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
>> run.'.
>>       'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per
>> run.'.
>>       'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per
>> run.'.
>>       'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per
>> run.'}.
>>
>>       As you can see, everything became faster except for smallN access.
>> This is the side-effect of optimizing for the average case instead
>>       of specific cases - like zero bytes. I decided not to use that
>> trick, because it decreased the overall performance.
>>
>>       I also wrote a benchmark which measures reads and writes together.
>> It generates random numbers which can be represented using a given
>>       number of bits. The result is an array of run times where values
>> having and odd index belong to big-endian access, and even ones to
>>       little-endian.
>>
>>       | byteArray inputs random storageBits unsigned |
>>       Smalltalk garbageCollect.
>>       random := Random seed: 36rSqueak.
>>       storageBits := 64.
>>       unsigned := true.
>>       byteArray := ByteArray new: storageBits // 8 * 2.
>>       inputs := Array new: 100000.
>>       (2 to: storageBits * 2 + 1) collect: [ :descriptor |
>>               "lowest bit describes endianness, the rest the number of
>> bits."
>>               | limit bigEndian offset |
>>               bigEndian := descriptor odd.
>>               limit := 1 << (descriptor >> 1) - 1.
>>               unsigned
>>                       ifTrue: [ offset := -1 ]
>>                       ifFalse: [ offset := -1- (limit >> 1) ].
>>               inputs replace: [ :each | (random nextInt: limit) + offset
>> ].
>>               [ 1 to: byteArray size - (storageBits // 8 - 1) do: [
>> :startIndex |
>>                       1 to: inputs size do: [ :inputIndex |
>>                               byteArray
>>                                       unsignedLong64At: startIndex
>>                                               put: (inputs at: inputIndex)
>>                                               bigEndian: bigEndian;
>>                                       unsignedLong64At: startIndex
>>                                               bigEndian: bigEndian ] ] ]
>> timeToRun ].
>>
>>       I ran it with various accessors and got the following results:
>>
>>       "short"
>>       #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
>> 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
>>       #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20
>> 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
>>
>>       "long"
>>       #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
>> 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
>>       72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584)
>> "average asFloat 87.28125".
>>       #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72
>> 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
>>       74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122)
>> "average asFloat 73.625".
>>
>>       "unsigned short"
>>       #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
>> 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
>>       #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
>> 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
>>
>>       "unsigned long"
>>       #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
>> 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
>>       60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694)
>> "average asFloat 82.40625".
>>        #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64
>> 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
>>       64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298)
>> "average asFloat 69.09375".
>>
>>       "unsigned long 64"
>>       #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300
>> 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
>>       316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324
>> 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
>>       636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626
>> 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
>>       822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880
>> 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
>>       536.109375".
>>       #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172
>> 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
>>       170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170
>> 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
>>       284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294
>> 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
>>       388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400
>> 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
>>       259.359375"
>>
>>
>>       Levente
>>
>>       On Sun, 30 Aug 2015, Chris Muller wrote:
>>
>>             Hi Chris, I think these methods belong in the image with the
>> fastest
>>             implementation we can do.
>>
>>             I implemented 64-bit unsigned access for Ma Serializer back
>> in 2005.
>>             I modeled my implementation after Andreas' original approach
>> which
>>             tries to avoid LI arithmetic.  I was curious whether your
>>             implementations would be faster, because if they are then it
>> could
>>             benefit Magma.  After loading "Ma Serializer" 1.5 (or head)
>> into a
>>             trunk image, I used the following script to take comparison
>>             measurements:
>>
>>             | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to:
>> (2
>>             raisedTo: 14)) atRandom.
>>             largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>             maBa := ByteArray new: 8.
>>             cbBa := ByteArray new: 8.
>>             maBa maUint: 64 at: 0 put: largeN.
>>             cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>             self assert: (cbBa maUnsigned64At: 1) = (maBa
>> unsignedLong64At: 1
>>             bigEndian: false).
>>             { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>             bigEndian: false] bench.
>>             'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ]
>> bench.
>>             'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false. ] bench.
>>             'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>             'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>             bigEndian: false] bench.
>>             'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ]
>> bench.
>>             'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>> false ] bench.
>>             'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>             }
>>
>>             Here are the results:
>>
>>             'cbc smallN write'->'3,110,000 per second.  322 nanoseconds
>> per run.' .
>>             'ma smallN write'->'4,770,000 per second.  210 nanoseconds
>> per run.' .
>>             'cbc smallN access'->'4,300,000 per second.  233 nanoseconds
>> per run.' .
>>             'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
>> per run.' .
>>             'cbc largeN write'->'907,000 per second.  1.1 microseconds
>> per run.' .
>>             'ma largeN write'->'6,620,000 per second.  151 nanoseconds
>> per run.' .
>>             'cbc largeN access'->'1,900,000 per second.  527 nanoseconds
>> per run.' .
>>             'ma largeN access'->'1,020,000 per second.  982 nanoseconds
>> per run.'
>>
>>             It looks like your 64-bit access is 86% faster for accessing
>> the
>>             high-end of the 64-bit range, but slower in the other 3
>> metrics.
>>             Noticeably, it was only 14% as fast for writing the high-end
>> of the
>>             64-bit range, and similarly as much slower for small-number
>> access..
>>
>>
>>             On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>             <cunningham.cb at gmail.com> wrote:
>>                   Hi.
>>
>>                   I've committed a change to the inbox with changes to
>> allow getting/putting
>>                   64bit values to ByteArrays (similar to 32 and 16 bit
>> accessors).  Could this
>>                   be added to trunk?
>>
>>                   Also, first time I used the selective commit function -
>> very nice!  the
>>                   changes I didn't want committed didn't, in fact, get
>> commited.  Just the
>>                   desirable bits!
>>
>>                   -cbc
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20150908/7ed0d1a1/attachment.htm