Re: [Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit manipulation

1 Sep 2015

      Chris,
All I'm trying to say is Learn From My Fail.  After dealing with such 
code for a while (in C --- yuck), I realized it was much better to use 
the compiler intrinsics.  Once I had the new code running, I deleted 
numerous implementations of the "let's swap bytes around" business. 
IIRC it was a net lines-of-code loss, and less code is great.
I'm not disputing the performance gains.  But consider how much faster 
and simpler still the compiler intrinsic approach could be.
The Smalltalk code does 8 at:, which are bound checked.  There are also 
7 additions, which have to be checked against overflow.  Then there are 
some comparisons, some ifFalse:, more bitShift: (overflow check), more 
additions (overflow check), and so on.  Creating large integers is going 
to be costly.  Even with optimized bounds / overflow checks, surely 
that's going to expand to tens of assembly instructions, if not a couple 
hundred.
I'm not saying this is an inefficient Smalltalk way of doing things, 
however I'd point out it's reimplementing what's available in hardware.
In constrast, if that code was made into a primitive, a relevant 
compiler intrinsic would be available to help.  The VM would grab a 64 
bit integer with a MOVBE instruction, perform one test for overflow, and 
if all is well then return tagging the integer with a single LEA 
instruction.  If the result must be a large integer, the VM might as 
well create it.  Surely a decent compiler can express all that in very 
little code, because it's effectively sweeping all the complexity under 
MOVBE.
For the sake of illustration, and assuming only one tag bit for 
simplicity, the assembly would be something like:
; calculate the pointer to dereference in rax, then...
movbe rax, [rax]
test rax, rax
js overflowToLargeInteger
lea rax, [rax+rax+1]
ret
; ok, it didn't fit, so...
overflowToLargeInteger:
call largePositiveIntegerFromRAX ; returning in RAX
ret
I'd imagine the integer arithmetic cannot possibly fit in that space.
Andres.
On 8/31/15 20:21 , Chris Cunningham wrote:
...
Hi Andres,
ByteArray currently doesn't have a primitive that handles any part of
getting bytes from the ByteArray and forming them into an integer.  If
it did have one, I would be happy to alter the code around that.
The long drawn out method is 4x faster for small (SmallInteger) results,
and 25% faster for LargeInteger results (those that excercise all 8
bytes).  This because it does at most 2 LargeInteger bitShifts, and as
little as no LargeInteger bitShifts.  The 'macro' version does a minimum
of 1 LargeInteger bitShifts, and up to 3 of them.
For BigEndian platforms, speed may be important; in any case, it is nice.
You are probably aware, but the current Squeak has does not have
#unsignedLong64At:bigEndian: in the image at all - that diff was from my
first attempt.
-cbc
On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud
<avalloud@smalltalk.comcastbiz.net
mailto:avalloud@smalltalk.comcastbiz.net> wrote:
Interesting about the fading relevancy of big endian platforms.
Just in case the point was lost, I meant the macro-style approach in
contrast with this (from Squeak-dev):

=============== Diff against Collections-cbc.650 ===============

Item was changed:
   ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category
'platform independent access') -----
   unsignedLong64At: index bigEndian: aBool
+       "Avoid as much largeInteger as we can"
+       | b0 b2 b3 b5 b6 w n2 n3 |
+
+       aBool ifFalse: [
+               w := self at: index.
+               b6 := self at: index+1.
+               b5 := self at: index+2.
+               n2 := self at: index+3.
+               b3 := self at: index+4.
+               b2 := self at: index+5.
+               n3 := self at: index+6.
+               b0 := self at: index+7.
+       ] ifTrue: [
+               b0 := self at: index.
+               n3 := self at: index+1.
+               b2 := self at: index+2.
+               b3 := self at: index+3.
+               n2 := self at: index+4.
+               b5 := self at: index+5.
+               b6 := self at: index+6.
+               w := self at: index+7.
+               ].
+
+       "Minimize LargeInteger arithmetic"
+       b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
+       b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
+
+       b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
+       b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
+       n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
+
+       b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
+       n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
+
+       ^w!
-       | n1 n2 |
-       aBool
-               ifTrue: [
-                       n2 := self unsignedLongAt: index  bigEndian:
true.
-                       n1 := self unsignedLongAt: index+4
bigEndian: true.
-                       ]
-               ifFalse: [
-                       n1 := self unsignedLongAt: index bigEndian:
false.
-                       n2 := self unsignedLongAt: index+4
bigEndian: false.
-                       ].
-       ^(n2 bitShift: 32) + n1!

I'd rather have that pushed down enough so that the compiler
intrinsic becomes visible.  And at that point, all that code is
reduced to a single instruction.

Andres.

On 8/31/15 19:12 , Eliot Miranda wrote:

    Hi Andres,

        On Aug 31, 2015, at 5:52 PM, Andres Valloud
        <avalloud@smalltalk.comcastbiz.net
        <mailto:avalloud@smalltalk.comcastbiz.net>> wrote:

        FWIW... IMO it's better to enable access to the relevant
        compiler intrinsic with platform specific macros, rather
        than implementing instructions such as Intel's BSWAP or
        MOVBE by hand.  In HPS, isolating endianness concerns from
        the large integer arithmetic primitives with such macros
        enabled 25-40% faster performance on big endian platforms.
        Just as importantly, the intrinsic approach takes
        significantly less code to implement.

    Makes sense, and the performance increases are impressive.  The
    only issue I have is that the Cog JIT (which would have the
    easiest time generating those intrinsics) currently runs only in
    little-endianness platforms and I seriously doubt it will run in
    a big endianness platform in the next five years.  PowerPC is
    the only possibility I see.  Yes, ARM is biendian but all the
    popular applications I know of are little endian.

    VW's a different beast; significant big endian legacy.

    But what you say about isolating makes perfect sense.  Thanks

            On 8/31/15 10:25 , Eliot Miranda wrote:
            Hi Chrises,

                  my vote would be to write these as 12 numbered
            primitives, (2,4 & 8
            bytes) * (at: & at:put:) * (big & little endian) because
            they can be
            performance critical and implementing them like this
            means the maximum
            efficiency in both 32-bit and 64-bit Spur, plus the
            possibility of the
            JIT implementing the primitives.

            On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
            <cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>
            <mailto:cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>>> wrote:

                 Hi Chris,

                 I'm all for having the fastest that in the image
            that works.  If you
                 could make your version handle endianess, then I'm
            all for including
                 it (at least in the 3 variants that are faster).
            My first use for
                 this (interface for KAFKA) apparently requires
            bigEndianess, so I
                 really want that supported.

                 It might be best to keep my naming, though - it
            follows the name
                 pattern that is already in the class.  Or will
            yours also support 128?

                 -cbc

                 On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller
            <asqueaker@gmail.com <mailto:asqueaker@gmail.com>
                 <mailto:asqueaker@gmail.com
            <mailto:asqueaker@gmail.com>>> wrote:

                     Hi Chris, I think these methods belong in the
            image with the fastest
                     implementation we can do.

                     I implemented 64-bit unsigned access for Ma
            Serializer back in 2005.
                     I modeled my implementation after Andreas'
            original approach which
                     tries to avoid LI arithmetic.  I was curious
            whether your
                     implementations would be faster, because if
            they are then it could
                     benefit Magma.  After loading "Ma Serializer"
            1.5 (or head) into a
                     trunk image, I used the following script to
            take comparison
                     measurements:

                     | smallN largeN maBa cbBa |  smallN := ((2
            raisedTo: 13) to: (2
                     raisedTo: 14)) atRandom.
                     largeN := ((2 raisedTo: 63) to: (2 raisedTo:
            64)) atRandom.
                     maBa := ByteArray new: 8.
                     cbBa := ByteArray new: 8.
                     maBa maUint: 64 at: 0 put: largeN.
                     cbBa unsignedLong64At: 1 put: largeN bigEndian:
            false.
                     self assert: (cbBa maUnsigned64At: 1) = (maBa
            unsignedLong64At: 1
                     bigEndian: false).
                     { 'cbc smallN write' -> [ cbBa
            unsignedLong64At: 1 put: smallN
                     bigEndian: false] bench.
                     'ma smallN write' -> [cbBa maUint: 64 at: 0
            put: smallN ] bench.
                     'cbc smallN access' -> [ cbBa unsignedLong64At:
            1 bigEndian:
                     false. ] bench.
                     'ma smallN access' -> [ cbBa maUnsigned64At: 1]
            bench.
                     'cbc largeN write' -> [ cbBa unsignedLong64At:
            1 put: largeN
                     bigEndian: false] bench.
                     'ma largeN write' -> [cbBa maUint: 64 at: 0
            put: largeN ] bench.
                     'cbc largeN access' -> [ cbBa unsignedLong64At:
            1 bigEndian:
                     false ] bench.
                     'ma largeN access' -> [ cbBa maUnsigned64At: 1]
            bench.
                       }

                     Here are the results:

                     'cbc smallN write'->'3,110,000 per second.  322
            nanoseconds per
                     run.' .
                     'ma smallN write'->'4,770,000 per second.  210
            nanoseconds per
                     run.' .
                     'cbc smallN access'->'4,300,000 per second.
            233 nanoseconds per
                     run.' .
                     'ma smallN access'->'16,400,000 per second.
            60.9 nanoseconds
                     per run.' .
                     'cbc largeN write'->'907,000 per second.  1.1
            microseconds per
                     run.' .
                     'ma largeN write'->'6,620,000 per second.  151
            nanoseconds per
                     run.' .
                     'cbc largeN access'->'1,900,000 per second.
            527 nanoseconds per
                     run.' .
                     'ma largeN access'->'1,020,000 per second.  982
            nanoseconds per
                     run.'

                     It looks like your 64-bit access is 86% faster
            for accessing the
                     high-end of the 64-bit range, but slower in the
            other 3 metrics.
                     Noticeably, it was only 14% as fast for writing
            the high-end of the
                     64-bit range, and similarly as much slower for
            small-number access..

                     On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
                     <cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>
            <mailto:cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>>> wrote:
                      > Hi.
                      >
                      > I've committed a change to the inbox with
            changes to allow
                     getting/putting
                      > 64bit values to ByteArrays (similar to 32
            and 16 bit
                     accessors).  Could this
                      > be added to trunk?
                      >
                      > Also, first time I used the selective commit
            function - very
                     nice!  the
                      > changes I didn't want committed didn't, in
            fact, get
                     commited.  Just the
                      > desirable bits!
                      >
                      > -cbc
                      >
                      >
                      >

            --
            _,,,^..^,,,_
            best, Eliot

    .

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Vm-dev] Re: [squeak-dev] ByteArray accessors for 64-bit manipulation