[Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit manipulation

Tue Sep 1 03:21:29 UTC 2015

Hi Andres,

ByteArray currently doesn't have a primitive that handles any part of
getting bytes from the ByteArray and forming them into an integer.  If it
did have one, I would be happy to alter the code around that.

The long drawn out method is 4x faster for small (SmallInteger) results,
and 25% faster for LargeInteger results (those that excercise all 8
bytes).  This because it does at most 2 LargeInteger bitShifts, and as
little as no LargeInteger bitShifts.  The 'macro' version does a minimum of
1 LargeInteger bitShifts, and up to 3 of them.

For BigEndian platforms, speed may be important; in any case, it is nice.

You are probably aware, but the current Squeak has does not have
#unsignedLong64At:bigEndian: in the image at all - that diff was from my
first attempt.

-cbc

On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud <
avalloud at smalltalk.comcastbiz.net> wrote:

>
> Interesting about the fading relevancy of big endian platforms.  Just in
> case the point was lost, I meant the macro-style approach in contrast with
> this (from Squeak-dev):
>
> =============== Diff against Collections-cbc.650 ===============
>
> Item was changed:
>   ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category
> 'platform independent access') -----
>   unsignedLong64At: index bigEndian: aBool
> +       "Avoid as much largeInteger as we can"
> +       | b0 b2 b3 b5 b6 w n2 n3 |
> +
> +       aBool ifFalse: [
> +               w := self at: index.
> +               b6 := self at: index+1.
> +               b5 := self at: index+2.
> +               n2 := self at: index+3.
> +               b3 := self at: index+4.
> +               b2 := self at: index+5.
> +               n3 := self at: index+6.
> +               b0 := self at: index+7.
> +       ] ifTrue: [
> +               b0 := self at: index.
> +               n3 := self at: index+1.
> +               b2 := self at: index+2.
> +               b3 := self at: index+3.
> +               n2 := self at: index+4.
> +               b5 := self at: index+5.
> +               b6 := self at: index+6.
> +               w := self at: index+7.
> +               ].
> +
> +       "Minimize LargeInteger arithmetic"
> +       b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
> +       b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
> +
> +       b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
> +       b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
> +       n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
> +
> +       b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
> +       n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
> +
> +       ^w!
> -       | n1 n2 |
> -       aBool
> -               ifTrue: [
> -                       n2 := self unsignedLongAt: index  bigEndian: true.
> -                       n1 := self unsignedLongAt: index+4  bigEndian:
> true.
> -                       ]
> -               ifFalse: [
> -                       n1 := self unsignedLongAt: index bigEndian: false.
> -                       n2 := self unsignedLongAt: index+4 bigEndian:
> false.
> -                       ].
> -       ^(n2 bitShift: 32) + n1!
>
>
> I'd rather have that pushed down enough so that the compiler intrinsic
> becomes visible.  And at that point, all that code is reduced to a single
> instruction.
>
> Andres.
>
>
>
> On 8/31/15 19:12 , Eliot Miranda wrote:
>
>> Hi Andres,
>>
>> On Aug 31, 2015, at 5:52 PM, Andres Valloud <
>>> avalloud at smalltalk.comcastbiz.net> wrote:
>>>
>>> FWIW... IMO it's better to enable access to the relevant compiler
>>> intrinsic with platform specific macros, rather than implementing
>>> instructions such as Intel's BSWAP or MOVBE by hand.  In HPS, isolating
>>> endianness concerns from the large integer arithmetic primitives with such
>>> macros enabled 25-40% faster performance on big endian platforms. Just as
>>> importantly, the intrinsic approach takes significantly less code to
>>> implement.
>>>
>>
>> Makes sense, and the performance increases are impressive.  The only
>> issue I have is that the Cog JIT (which would have the easiest time
>> generating those intrinsics) currently runs only in little-endianness
>> platforms and I seriously doubt it will run in a big endianness platform in
>> the next five years.  PowerPC is the only possibility I see.  Yes, ARM is
>> biendian but all the popular applications I know of are little endian.
>>
>> VW's a different beast; significant big endian legacy.
>>
>> But what you say about isolating makes perfect sense.  Thanks
>>
>>
>>> On 8/31/15 10:25 , Eliot Miranda wrote:
>>>> Hi Chrises,
>>>>
>>>>      my vote would be to write these as 12 numbered primitives, (2,4 & 8
>>>> bytes) * (at: & at:put:) * (big & little endian) because they can be
>>>> performance critical and implementing them like this means the maximum
>>>> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
>>>> JIT implementing the primitives.
>>>>
>>>> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
>>>> <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>> wrote:
>>>>
>>>>     Hi Chris,
>>>>
>>>>     I'm all for having the fastest that in the image that works.  If you
>>>>     could make your version handle endianess, then I'm all for including
>>>>     it (at least in the 3 variants that are faster).  My first use for
>>>>     this (interface for KAFKA) apparently requires bigEndianess, so I
>>>>     really want that supported.
>>>>
>>>>     It might be best to keep my naming, though - it follows the name
>>>>     pattern that is already in the class.  Or will yours also support
>>>> 128?
>>>>
>>>>     -cbc
>>>>
>>>>     On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker at gmail.com
>>>>     <mailto:asqueaker at gmail.com>> wrote:
>>>>
>>>>         Hi Chris, I think these methods belong in the image with the
>>>> fastest
>>>>         implementation we can do.
>>>>
>>>>         I implemented 64-bit unsigned access for Ma Serializer back in
>>>> 2005.
>>>>         I modeled my implementation after Andreas' original approach
>>>> which
>>>>         tries to avoid LI arithmetic.  I was curious whether your
>>>>         implementations would be faster, because if they are then it
>>>> could
>>>>         benefit Magma.  After loading "Ma Serializer" 1.5 (or head)
>>>> into a
>>>>         trunk image, I used the following script to take comparison
>>>>         measurements:
>>>>
>>>>         | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
>>>>         raisedTo: 14)) atRandom.
>>>>         largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>>>         maBa := ByteArray new: 8.
>>>>         cbBa := ByteArray new: 8.
>>>>         maBa maUint: 64 at: 0 put: largeN.
>>>>         cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>>>         self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
>>>> 1
>>>>         bigEndian: false).
>>>>         { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>>>         bigEndian: false] bench.
>>>>         'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>>>         'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>>         false. ] bench.
>>>>         'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>>         'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>>>         bigEndian: false] bench.
>>>>         'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>>>         'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>>         false ] bench.
>>>>         'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>>           }
>>>>
>>>>         Here are the results:
>>>>
>>>>         'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
>>>>         run.' .
>>>>         'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
>>>>         run.' .
>>>>         'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
>>>>         run.' .
>>>>         'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
>>>>         per run.' .
>>>>         'cbc largeN write'->'907,000 per second.  1.1 microseconds per
>>>>         run.' .
>>>>         'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
>>>>         run.' .
>>>>         'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
>>>>         run.' .
>>>>         'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
>>>>         run.'
>>>>
>>>>         It looks like your 64-bit access is 86% faster for accessing the
>>>>         high-end of the 64-bit range, but slower in the other 3 metrics.
>>>>         Noticeably, it was only 14% as fast for writing the high-end of
>>>> the
>>>>         64-bit range, and similarly as much slower for small-number
>>>> access..
>>>>
>>>>
>>>>         On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>>>         <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>>
>>>> wrote:
>>>>          > Hi.
>>>>          >
>>>>          > I've committed a change to the inbox with changes to allow
>>>>         getting/putting
>>>>          > 64bit values to ByteArrays (similar to 32 and 16 bit
>>>>         accessors).  Could this
>>>>          > be added to trunk?
>>>>          >
>>>>          > Also, first time I used the selective commit function - very
>>>>         nice!  the
>>>>          > changes I didn't want committed didn't, in fact, get
>>>>         commited.  Just the
>>>>          > desirable bits!
>>>>          >
>>>>          > -cbc
>>>>          >
>>>>          >
>>>>          >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>>
>>>>
>>>> .
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20150831/c175725b/attachment.htm