[Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit manipulation

Tue Sep 1 03:42:43 UTC 2015

Ok.  Committed Collections-cbc.652.mcz to the inbox.  It has a faster
unsignedLong64At:put:bigEndian:.  It essentially used the ma code, but
checks for endianness to make it work in those situations.

Speed test:

smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom.
largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
maBa := ByteArray new: 8.
cbBa := ByteArray new: 8.
maBa maUint: 64 at: 0 put: largeN.
cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
bigEndian: false).
self assert: (cbBa maUnsigned64At: 1) = (cbBa unsignedLong64At: 1
bigEndian: false).

{
'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian:
false] bench.
'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian:
false] bench.
'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
 }
 {
'cbc smallN write'->'3,770,000 per second. 266 nanoseconds per run.' .
'ma smallN write'->'3,700,000 per second. 270 nanoseconds per run.' .
'cbc largeN write'->'4,190,000 per second. 238 nanoseconds per run.' .
'ma largeN write'->'4,120,000 per second. 243 nanoseconds per run.'
}

I would like to have this pushed to Trunk so that we have a shared 64bit
access to ByteArrays.  I know of at least 3 places this has been coded: my
code, VMMaker, and MA Serializable.  Probably several other places exist.

The put code (benchmarked above) shouldn't be affected if it is run in a 64
bit image - if anything just a bit faster.

The get code (also in the committed code) should not get any slower in a 64
bit image, but if the SmallInteger uses more
bits in its representation than the 32 bit image, it could be optimized to
run faster there at the cost of running slower on 32 bit images.  So
probably not worth fixing in this code. If you* do make a primitive for
these accesses, then those would obviously be made significantly faster,
and this could be the fallback code.

(* having never made a primitive nor compiled the image - yet - I probably
won't be added these. And I don't need that speed myself at this point,
either.)

-cbc

On Mon, Aug 31, 2015 at 8:21 PM, Chris Cunningham <cunningham.cb at gmail.com>
wrote:

> Hi Andres,
>
> ByteArray currently doesn't have a primitive that handles any part of
> getting bytes from the ByteArray and forming them into an integer.  If it
> did have one, I would be happy to alter the code around that.
>
> The long drawn out method is 4x faster for small (SmallInteger) results,
> and 25% faster for LargeInteger results (those that excercise all 8
> bytes).  This because it does at most 2 LargeInteger bitShifts, and as
> little as no LargeInteger bitShifts.  The 'macro' version does a minimum of
> 1 LargeInteger bitShifts, and up to 3 of them.
>
> For BigEndian platforms, speed may be important; in any case, it is nice.
>
> You are probably aware, but the current Squeak has does not have
> #unsignedLong64At:bigEndian: in the image at all - that diff was from my
> first attempt.
>
> -cbc
>
> On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud <
> avalloud at smalltalk.comcastbiz.net> wrote:
>
>>
>> Interesting about the fading relevancy of big endian platforms.  Just in
>> case the point was lost, I meant the macro-style approach in contrast with
>> this (from Squeak-dev):
>>
>> =============== Diff against Collections-cbc.650 ===============
>>
>> Item was changed:
>>   ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category
>> 'platform independent access') -----
>>   unsignedLong64At: index bigEndian: aBool
>> +       "Avoid as much largeInteger as we can"
>> +       | b0 b2 b3 b5 b6 w n2 n3 |
>> +
>> +       aBool ifFalse: [
>> +               w := self at: index.
>> +               b6 := self at: index+1.
>> +               b5 := self at: index+2.
>> +               n2 := self at: index+3.
>> +               b3 := self at: index+4.
>> +               b2 := self at: index+5.
>> +               n3 := self at: index+6.
>> +               b0 := self at: index+7.
>> +       ] ifTrue: [
>> +               b0 := self at: index.
>> +               n3 := self at: index+1.
>> +               b2 := self at: index+2.
>> +               b3 := self at: index+3.
>> +               n2 := self at: index+4.
>> +               b5 := self at: index+5.
>> +               b6 := self at: index+6.
>> +               w := self at: index+7.
>> +               ].
>> +
>> +       "Minimize LargeInteger arithmetic"
>> +       b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
>> +       b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
>> +
>> +       b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
>> +       b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
>> +       n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
>> +
>> +       b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
>> +       n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
>> +
>> +       ^w!
>> -       | n1 n2 |
>> -       aBool
>> -               ifTrue: [
>> -                       n2 := self unsignedLongAt: index  bigEndian: true.
>> -                       n1 := self unsignedLongAt: index+4  bigEndian:
>> true.
>> -                       ]
>> -               ifFalse: [
>> -                       n1 := self unsignedLongAt: index bigEndian: false.
>> -                       n2 := self unsignedLongAt: index+4 bigEndian:
>> false.
>> -                       ].
>> -       ^(n2 bitShift: 32) + n1!
>>
>>
>> I'd rather have that pushed down enough so that the compiler intrinsic
>> becomes visible.  And at that point, all that code is reduced to a single
>> instruction.
>>
>> Andres.
>>
>>
>>
>> On 8/31/15 19:12 , Eliot Miranda wrote:
>>
>>> Hi Andres,
>>>
>>> On Aug 31, 2015, at 5:52 PM, Andres Valloud <
>>>> avalloud at smalltalk.comcastbiz.net> wrote:
>>>>
>>>> FWIW... IMO it's better to enable access to the relevant compiler
>>>> intrinsic with platform specific macros, rather than implementing
>>>> instructions such as Intel's BSWAP or MOVBE by hand.  In HPS, isolating
>>>> endianness concerns from the large integer arithmetic primitives with such
>>>> macros enabled 25-40% faster performance on big endian platforms. Just as
>>>> importantly, the intrinsic approach takes significantly less code to
>>>> implement.
>>>>
>>>
>>> Makes sense, and the performance increases are impressive.  The only
>>> issue I have is that the Cog JIT (which would have the easiest time
>>> generating those intrinsics) currently runs only in little-endianness
>>> platforms and I seriously doubt it will run in a big endianness platform in
>>> the next five years.  PowerPC is the only possibility I see.  Yes, ARM is
>>> biendian but all the popular applications I know of are little endian.
>>>
>>> VW's a different beast; significant big endian legacy.
>>>
>>> But what you say about isolating makes perfect sense.  Thanks
>>>
>>>
>>>> On 8/31/15 10:25 , Eliot Miranda wrote:
>>>>> Hi Chrises,
>>>>>
>>>>>      my vote would be to write these as 12 numbered primitives, (2,4 &
>>>>> 8
>>>>> bytes) * (at: & at:put:) * (big & little endian) because they can be
>>>>> performance critical and implementing them like this means the maximum
>>>>> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
>>>>> JIT implementing the primitives.
>>>>>
>>>>> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
>>>>> <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>> wrote:
>>>>>
>>>>>     Hi Chris,
>>>>>
>>>>>     I'm all for having the fastest that in the image that works.  If
>>>>> you
>>>>>     could make your version handle endianess, then I'm all for
>>>>> including
>>>>>     it (at least in the 3 variants that are faster).  My first use for
>>>>>     this (interface for KAFKA) apparently requires bigEndianess, so I
>>>>>     really want that supported.
>>>>>
>>>>>     It might be best to keep my naming, though - it follows the name
>>>>>     pattern that is already in the class.  Or will yours also support
>>>>> 128?
>>>>>
>>>>>     -cbc
>>>>>
>>>>>     On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker at gmail.com
>>>>>     <mailto:asqueaker at gmail.com>> wrote:
>>>>>
>>>>>         Hi Chris, I think these methods belong in the image with the
>>>>> fastest
>>>>>         implementation we can do.
>>>>>
>>>>>         I implemented 64-bit unsigned access for Ma Serializer back in
>>>>> 2005.
>>>>>         I modeled my implementation after Andreas' original approach
>>>>> which
>>>>>         tries to avoid LI arithmetic.  I was curious whether your
>>>>>         implementations would be faster, because if they are then it
>>>>> could
>>>>>         benefit Magma.  After loading "Ma Serializer" 1.5 (or head)
>>>>> into a
>>>>>         trunk image, I used the following script to take comparison
>>>>>         measurements:
>>>>>
>>>>>         | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
>>>>>         raisedTo: 14)) atRandom.
>>>>>         largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>>>>         maBa := ByteArray new: 8.
>>>>>         cbBa := ByteArray new: 8.
>>>>>         maBa maUint: 64 at: 0 put: largeN.
>>>>>         cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>>>>         self assert: (cbBa maUnsigned64At: 1) = (maBa
>>>>> unsignedLong64At: 1
>>>>>         bigEndian: false).
>>>>>         { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>>>>         bigEndian: false] bench.
>>>>>         'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ]
>>>>> bench.
>>>>>         'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>>>         false. ] bench.
>>>>>         'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>>>         'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>>>>         bigEndian: false] bench.
>>>>>         'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ]
>>>>> bench.
>>>>>         'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>>>         false ] bench.
>>>>>         'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>>>           }
>>>>>
>>>>>         Here are the results:
>>>>>
>>>>>         'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
>>>>>         run.' .
>>>>>         'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
>>>>>         run.' .
>>>>>         'cbc smallN access'->'4,300,000 per second.  233 nanoseconds
>>>>> per
>>>>>         run.' .
>>>>>         'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
>>>>>         per run.' .
>>>>>         'cbc largeN write'->'907,000 per second.  1.1 microseconds per
>>>>>         run.' .
>>>>>         'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
>>>>>         run.' .
>>>>>         'cbc largeN access'->'1,900,000 per second.  527 nanoseconds
>>>>> per
>>>>>         run.' .
>>>>>         'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
>>>>>         run.'
>>>>>
>>>>>         It looks like your 64-bit access is 86% faster for accessing
>>>>> the
>>>>>         high-end of the 64-bit range, but slower in the other 3
>>>>> metrics.
>>>>>         Noticeably, it was only 14% as fast for writing the high-end
>>>>> of the
>>>>>         64-bit range, and similarly as much slower for small-number
>>>>> access..
>>>>>
>>>>>
>>>>>         On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>>>>         <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>>
>>>>> wrote:
>>>>>          > Hi.
>>>>>          >
>>>>>          > I've committed a change to the inbox with changes to allow
>>>>>         getting/putting
>>>>>          > 64bit values to ByteArrays (similar to 32 and 16 bit
>>>>>         accessors).  Could this
>>>>>          > be added to trunk?
>>>>>          >
>>>>>          > Also, first time I used the selective commit function - very
>>>>>         nice!  the
>>>>>          > changes I didn't want committed didn't, in fact, get
>>>>>         commited.  Just the
>>>>>          > desirable bits!
>>>>>          >
>>>>>          > -cbc
>>>>>          >
>>>>>          >
>>>>>          >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> _,,,^..^,,,_
>>>>> best, Eliot
>>>>>
>>>>>
>>>>>
>>>>> .
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20150831/4b873b36/attachment-0001.htm