[Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit
manipulation
Chris Cunningham
cunningham.cb at gmail.com
Tue Sep 1 03:21:29 UTC 2015
Hi Andres,
ByteArray currently doesn't have a primitive that handles any part of
getting bytes from the ByteArray and forming them into an integer. If it
did have one, I would be happy to alter the code around that.
The long drawn out method is 4x faster for small (SmallInteger) results,
and 25% faster for LargeInteger results (those that excercise all 8
bytes). This because it does at most 2 LargeInteger bitShifts, and as
little as no LargeInteger bitShifts. The 'macro' version does a minimum of
1 LargeInteger bitShifts, and up to 3 of them.
For BigEndian platforms, speed may be important; in any case, it is nice.
You are probably aware, but the current Squeak has does not have
#unsignedLong64At:bigEndian: in the image at all - that diff was from my
first attempt.
-cbc
On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud <
avalloud at smalltalk.comcastbiz.net> wrote:
>
> Interesting about the fading relevancy of big endian platforms. Just in
> case the point was lost, I meant the macro-style approach in contrast with
> this (from Squeak-dev):
>
> =============== Diff against Collections-cbc.650 ===============
>
> Item was changed:
> ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category
> 'platform independent access') -----
> unsignedLong64At: index bigEndian: aBool
> + "Avoid as much largeInteger as we can"
> + | b0 b2 b3 b5 b6 w n2 n3 |
> +
> + aBool ifFalse: [
> + w := self at: index.
> + b6 := self at: index+1.
> + b5 := self at: index+2.
> + n2 := self at: index+3.
> + b3 := self at: index+4.
> + b2 := self at: index+5.
> + n3 := self at: index+6.
> + b0 := self at: index+7.
> + ] ifTrue: [
> + b0 := self at: index.
> + n3 := self at: index+1.
> + b2 := self at: index+2.
> + b3 := self at: index+3.
> + n2 := self at: index+4.
> + b5 := self at: index+5.
> + b6 := self at: index+6.
> + w := self at: index+7.
> + ].
> +
> + "Minimize LargeInteger arithmetic"
> + b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
> + b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
> +
> + b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
> + b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
> + n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
> +
> + b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
> + n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
> +
> + ^w!
> - | n1 n2 |
> - aBool
> - ifTrue: [
> - n2 := self unsignedLongAt: index bigEndian: true.
> - n1 := self unsignedLongAt: index+4 bigEndian:
> true.
> - ]
> - ifFalse: [
> - n1 := self unsignedLongAt: index bigEndian: false.
> - n2 := self unsignedLongAt: index+4 bigEndian:
> false.
> - ].
> - ^(n2 bitShift: 32) + n1!
>
>
> I'd rather have that pushed down enough so that the compiler intrinsic
> becomes visible. And at that point, all that code is reduced to a single
> instruction.
>
> Andres.
>
>
>
> On 8/31/15 19:12 , Eliot Miranda wrote:
>
>> Hi Andres,
>>
>> On Aug 31, 2015, at 5:52 PM, Andres Valloud <
>>> avalloud at smalltalk.comcastbiz.net> wrote:
>>>
>>> FWIW... IMO it's better to enable access to the relevant compiler
>>> intrinsic with platform specific macros, rather than implementing
>>> instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating
>>> endianness concerns from the large integer arithmetic primitives with such
>>> macros enabled 25-40% faster performance on big endian platforms. Just as
>>> importantly, the intrinsic approach takes significantly less code to
>>> implement.
>>>
>>
>> Makes sense, and the performance increases are impressive. The only
>> issue I have is that the Cog JIT (which would have the easiest time
>> generating those intrinsics) currently runs only in little-endianness
>> platforms and I seriously doubt it will run in a big endianness platform in
>> the next five years. PowerPC is the only possibility I see. Yes, ARM is
>> biendian but all the popular applications I know of are little endian.
>>
>> VW's a different beast; significant big endian legacy.
>>
>> But what you say about isolating makes perfect sense. Thanks
>>
>>
>>> On 8/31/15 10:25 , Eliot Miranda wrote:
>>>> Hi Chrises,
>>>>
>>>> my vote would be to write these as 12 numbered primitives, (2,4 & 8
>>>> bytes) * (at: & at:put:) * (big & little endian) because they can be
>>>> performance critical and implementing them like this means the maximum
>>>> efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
>>>> JIT implementing the primitives.
>>>>
>>>> On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
>>>> <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>> wrote:
>>>>
>>>> Hi Chris,
>>>>
>>>> I'm all for having the fastest that in the image that works. If you
>>>> could make your version handle endianess, then I'm all for including
>>>> it (at least in the 3 variants that are faster). My first use for
>>>> this (interface for KAFKA) apparently requires bigEndianess, so I
>>>> really want that supported.
>>>>
>>>> It might be best to keep my naming, though - it follows the name
>>>> pattern that is already in the class. Or will yours also support
>>>> 128?
>>>>
>>>> -cbc
>>>>
>>>> On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker at gmail.com
>>>> <mailto:asqueaker at gmail.com>> wrote:
>>>>
>>>> Hi Chris, I think these methods belong in the image with the
>>>> fastest
>>>> implementation we can do.
>>>>
>>>> I implemented 64-bit unsigned access for Ma Serializer back in
>>>> 2005.
>>>> I modeled my implementation after Andreas' original approach
>>>> which
>>>> tries to avoid LI arithmetic. I was curious whether your
>>>> implementations would be faster, because if they are then it
>>>> could
>>>> benefit Magma. After loading "Ma Serializer" 1.5 (or head)
>>>> into a
>>>> trunk image, I used the following script to take comparison
>>>> measurements:
>>>>
>>>> | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2
>>>> raisedTo: 14)) atRandom.
>>>> largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
>>>> maBa := ByteArray new: 8.
>>>> cbBa := ByteArray new: 8.
>>>> maBa maUint: 64 at: 0 put: largeN.
>>>> cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
>>>> self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
>>>> 1
>>>> bigEndian: false).
>>>> { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
>>>> bigEndian: false] bench.
>>>> 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
>>>> 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>> false. ] bench.
>>>> 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>> 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
>>>> bigEndian: false] bench.
>>>> 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
>>>> 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
>>>> false ] bench.
>>>> 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
>>>> }
>>>>
>>>> Here are the results:
>>>>
>>>> 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per
>>>> run.' .
>>>> 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per
>>>> run.' .
>>>> 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per
>>>> run.' .
>>>> 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds
>>>> per run.' .
>>>> 'cbc largeN write'->'907,000 per second. 1.1 microseconds per
>>>> run.' .
>>>> 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per
>>>> run.' .
>>>> 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per
>>>> run.' .
>>>> 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per
>>>> run.'
>>>>
>>>> It looks like your 64-bit access is 86% faster for accessing the
>>>> high-end of the 64-bit range, but slower in the other 3 metrics.
>>>> Noticeably, it was only 14% as fast for writing the high-end of
>>>> the
>>>> 64-bit range, and similarly as much slower for small-number
>>>> access..
>>>>
>>>>
>>>> On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
>>>> <cunningham.cb at gmail.com <mailto:cunningham.cb at gmail.com>>
>>>> wrote:
>>>> > Hi.
>>>> >
>>>> > I've committed a change to the inbox with changes to allow
>>>> getting/putting
>>>> > 64bit values to ByteArrays (similar to 32 and 16 bit
>>>> accessors). Could this
>>>> > be added to trunk?
>>>> >
>>>> > Also, first time I used the selective commit function - very
>>>> nice! the
>>>> > changes I didn't want committed didn't, in fact, get
>>>> commited. Just the
>>>> > desirable bits!
>>>> >
>>>> > -cbc
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> _,,,^..^,,,_
>>>> best, Eliot
>>>>
>>>>
>>>>
>>>> .
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20150831/c175725b/attachment.htm
More information about the Vm-dev
mailing list