Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Hi Chris,
I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?
-cbc
On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow
getting/putting
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
this
be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.
On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi Chris,
I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?
-cbc
On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow
getting/putting
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
this
be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)
On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.
On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi Chris,
I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?
-cbc
On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
-- _,,,^..^,,,_ best, Eliot
On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller asqueaker@gmail.com wrote:
Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)
Of course its possible, but such an architecture can hardly be quick. If one needs the flexible primitives then use them, but don't hobble the system by only providing them. Having a real 64-bit VM means that the use of 2 32-bit accesses is unnecessarily slow.
Which would you rather, and which would you think would be faster (I don't know, but I have my suspicions):
Expand the existing flexible integerAt: prims to integerAt:put:bytes:signed:bigEndian: (yuck), or implement this in terms of a wrapper something like
ByteArray>>integerAt: index bytes: numBytes signed: signed bigEndian: bigEndian
^size >= 4 ifTrue: [size = 8 ifTrue: [value := self unsignedLong64At: index. bigEndian ifTrue: [value := self byteReverseEightBytes: value]. (sign := value bitShift: -63) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 64)]. ^value]. size = 4 ifTrue: [value := self unsignedLong32At: index. bigEndian ifTrue: [value := self byteReverseFourBytes: value]. (sign := value bitShift: -31) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 32)]. ^value]. ^self error: 'size must be a power of two from 1 to 8'] ifFalse: ...
On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
JIT
implementing the primitives.
On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <
cunningham.cb@gmail.com>
wrote:
Hi Chris,
I'm all for having the fastest that in the image that works. If you
could
make your version handle endianess, then I'm all for including it (at
least
in the 3 variants that are faster). My first use for this (interface
for
KAFKA) apparently requires bigEndianess, so I really want that
supported.
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?
-cbc
On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com
wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.'
.
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per
run.' .
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.'
.
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors).
Could
this be added to trunk?
Also, first time I used the selective commit function - very nice!
the
changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
-- _,,,^..^,,,_ best, Eliot
I would ask that someone please measure the real-world performance benefit of adding these (or any other) numbered primitives. Maybe it's a lot, maybe it's not, but when in doubt leave it out.
Dave
On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote:
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.
On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi Chris,
I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?
-cbc
On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow
getting/putting
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could
this
be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
-- _,,,^..^,,,_ best, Eliot
FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.
On 8/31/15 10:25 , Eliot Miranda wrote:
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.
On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:
Hi Chris, I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported. It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128? -cbc On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com <mailto:asqueaker@gmail.com>> wrote: Hi Chris, I think these methods belong in the image with the fastest implementation we can do. I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements: | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. } Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access.. On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>> wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > >
-- _,,,^..^,,,_ best, Eliot
Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
I ran it with various accessors and got the following results:
"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
Levente
On Sun, 30 Aug 2015, Chris Muller wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Levente,
Interesting. I have a question and a concern about your implementation, though:
Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?
the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?
-cbc
On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote:
Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
I ran it with various accessors and got the following results:
"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
Levente
On Sun, 30 Aug 2015, Chris Muller wrote:
Hi Chris, I think these methods belong in the image with the fastest
implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Hi Chris,
I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.
Levente
On Tue, 8 Sep 2015, Chris Cunningham wrote:
Levente, Interesting. I have a question and a concern about your implementation, though:
Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?
the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?
-cbc
On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian. | byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ]. I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". "long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". "unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". "unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". "unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375" Levente On Sun, 30 Aug 2015, Chris Muller wrote: Hi Chris, I think these methods belong in the image with the fastest implementation we can do. I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements: | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. } Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access.. On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com> wrote: Hi. I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk? Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits! -cbc
After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives
{ 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}
Which means +20% speed for smallN writes, and +18% for largeN access.
Levente
P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.
On Tue, 8 Sep 2015, Levente Uzonyi wrote:
Hi Chris,
I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.
Levente
On Tue, 8 Sep 2015, Chris Cunningham wrote:
Levente, Interesting. I have a question and a concern about your implementation, though:
Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?
the concern is that your code is nicely optimized for our current 32bit vm
- but once we go to 64 bit, I think it will fail. Should we be
concerned?
-cbc
On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
A bit later than I wanted to, but I've finally uploaded my versions
to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
As you can see, everything became faster except for smallN access.
This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
I also wrote a benchmark which measures reads and writes together. It
generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of
bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312
306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
Levente On Sun, 30 Aug 2015, Chris Muller wrote: Hi Chris, I think these methods belong in the image with the
fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in
I modeled my implementation after Andreas' original approach
which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per
run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing
the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com> wrote: Hi. I've committed a change to the inbox with changes to
allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function -
very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
Hi Levente,
Sent from my iPhone
On Sep 9, 2015, at 2:11 PM, Levente Uzonyi leves@elte.hu wrote:
After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives
{ 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}
Which means +20% speed for smallN writes, and +18% for largeN access.
Levente
P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.
IIRC that's due to a bug in earlier versions of the bootstrap in converting negative integers, or perhaps larger positive smallintegers. Anyway I *think* that the latest Spur 64 but VM image combination works properly now. Apologies.
On Tue, 8 Sep 2015, Levente Uzonyi wrote:
Hi Chris,
I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.
Levente
On Tue, 8 Sep 2015, Chris Cunningham wrote:
Levente, Interesting. I have a question and a concern about your implementation, though: Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas? the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned? -cbc On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian. | byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ]. I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". "long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". "unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". "unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". "unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375" Levente On Sun, 30 Aug 2015, Chris Muller wrote: Hi Chris, I think these methods belong in the image with the fastest implementation we can do. I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements: | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. } Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access.. On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com> wrote: Hi. I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk? Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits! -cbc
I forgot to answer your other questions.
On Tue, 8 Sep 2015, Chris Cunningham wrote:
Levente, Interesting. I have a question and a concern about your implementation, though:
Question: why in the micro checks, is the Write faster than the Access:
Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes.
{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access?
For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine.
"unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?
So this is
(read and write Before) "average asFloat 536.109375" (read and write After) "average asFloat 259.359375"
Levente
the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?
-cbc
On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following: Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}. After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance. I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian. | byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ]. I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5". "long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625". "unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125". "unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375". "unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375" Levente On Sun, 30 Aug 2015, Chris Muller wrote: Hi Chris, I think these methods belong in the image with the fastest implementation we can do. I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements: | smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. } Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.' It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access.. On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com> wrote: Hi. I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk? Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits! -cbc
Ok, it makes sense now. Well, it made sense before, I was just slow to see it.
-cbc
On Tue, Sep 8, 2015 at 9:25 AM, Levente Uzonyi leves@elte.hu wrote:
I forgot to answer your other questions.
On Tue, 8 Sep 2015, Chris Cunningham wrote:
Levente,
Interesting. I have a question and a concern about your implementation, though:
Question: why in the micro checks, is the Write faster than the Access:
Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes.
{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access?
For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine.
"unsigned long 64"
(put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?
So this is
(read and write Before) "average asFloat 536.109375" (read and write After) "average asFloat 259.359375"
Levente
the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?
-cbc
On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
A bit later than I wanted to, but I've finally uploaded my versions
to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
As you can see, everything became faster except for smallN access.
This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
I also wrote a benchmark which measures reads and writes together.
It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of
bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
I ran it with various accessors and got the following results: "short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300
312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
Levente On Sun, 30 Aug 2015, Chris Muller wrote: Hi Chris, I think these methods belong in the image with the
fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back
in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to:
(2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results: 'cbc smallN write'->'3,110,000 per second. 322 nanoseconds
per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing
the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham <cunningham.cb@gmail.com> wrote: Hi. I've committed a change to the inbox with changes to
allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function -
very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
(Ha! What an amazing thread!) Levente, why am I not surprised that you could manage to squeeze yet even more efficiency out of it. If there were an eXtreme sports for programming, you'd win the gold medal... :)
I've got to try this on the MagmaBenchmarker right now, I'll let you know in a bit..
On Tue, Sep 8, 2015 at 4:42 AM, Levente Uzonyi leves@elte.hu wrote:
Hi All,
A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases
- like zero bytes. I decided not to use that trick, because it decreased the
overall performance.
I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
I ran it with various accessors and got the following results:
"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
Levente
On Sun, 30 Aug 2015, Chris Muller wrote:
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.
I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
Here are the results:
'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi.
I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
-cbc
squeak-dev@lists.squeakfoundation.org