ByteArray accessors for 64-bit manipulation

List overview All Threads
Download

newer

older

ByteArray accessors - questions...

Daily Commit Log

Chris Cunningham

29 Aug 2015 29 Aug '15

1:01 a.m.

Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Attachments:

attachment.html (text/html — 452 bytes)

Show replies by date

Chris Muller

30 Aug 30 Aug

11:38 p.m.

Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...

Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Chris Cunningham

31 Aug 31 Aug

7:01 a.m.

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...

Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Eliot Miranda

7:25 p.m.

Hi Chrises,

my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Chris Muller

8:35 p.m.

Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...

Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Eliot Miranda

9:21 p.m.

New subject: [Vm-dev] Re: ByteArray accessors for 64-bit manipulation

On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller asqueaker@gmail.com wrote:

...

Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

Of course its possible, but such an architecture can hardly be quick. If one needs the flexible primitives then use them, but don't hobble the system by only providing them. Having a real 64-bit VM means that the use of 2 32-bit accesses is unnecessarily slow.

Which would you rather, and which would you think would be faster (I don't know, but I have my suspicions):

Expand the existing flexible integerAt: prims to integerAt:put:bytes:signed:bigEndian: (yuck), or implement this in terms of a wrapper something like

ByteArray>>integerAt: index bytes: numBytes signed: signed bigEndian: bigEndian

^size >= 4 ifTrue: [size = 8 ifTrue: [value := self unsignedLong64At: index. bigEndian ifTrue: [value := self byteReverseEightBytes: value]. (sign := value bitShift: -63) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 64)]. ^value]. size = 4 ifTrue: [value := self unsignedLong32At: index. bigEndian ifTrue: [value := self byteReverseFourBytes: value]. (sign := value bitShift: -31) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 32)]. ^value]. ^self error: 'size must be a power of two from 1 to 8'] ifFalse: ...

...

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
JIT

...
implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <

cunningham.cb@gmail.com>

...
wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you

could

...
...
make your version handle endianess, then I'm all for including it (at

least

...
...
in the 3 variants that are faster). My first use for this (interface

for

...
...
KAFKA) apparently requires bigEndianess, so I really want that

supported.

...
...
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com

wrote:

...
...
...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.'

.

...
...
...
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per

run.' .

...
...
...
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.'

.

...
...
...
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors).

Could

...
...
...
...
this be added to trunk?

Also, first time I used the selective commit function - very nice!

the

...
...
...
...
changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

-- _,,,^..^,,,_ best, Eliot

David T. Lewis

1 Sep 1 Sep

2:21 a.m.

I would ask that someone please measure the real-world performance benefit of adding these (or any other) numbered primitives. Maybe it's a lot, maybe it's not, but when in doubt leave it out.

Dave

On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote:

...

Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

...

Andres Valloud

2:52 a.m.

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

On 8/31/15 10:25 , Eliot Miranda wrote:

...

Hi Chrises,

 my vote would be to write these as 12 numbered primitives, (2,4 & 8

bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works.  If you
could make your version handle endianess, then I'm all for including
it (at least in the 3 variants that are faster).  My first use for
this (interface for KAFKA) apparently requires bigEndianess, so I
really want that supported.

It might be best to keep my naming, though - it follows the name
pattern that is already in the class.  Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com
<mailto:asqueaker@gmail.com>> wrote:

    Hi Chris, I think these methods belong in the image with the fastest
    implementation we can do.

    I implemented 64-bit unsigned access for Ma Serializer back in 2005.
    I modeled my implementation after Andreas' original approach which
    tries to avoid LI arithmetic.  I was curious whether your
    implementations would be faster, because if they are then it could
    benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
    trunk image, I used the following script to take comparison
    measurements:

    | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
    raisedTo: 14)) atRandom.
    largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
    maBa := ByteArray new: 8.
    cbBa := ByteArray new: 8.
    maBa maUint: 64 at: 0 put: largeN.
    cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
    self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
    bigEndian: false).
    { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
    bigEndian: false] bench.
    'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
    'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false. ] bench.
    'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
    'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
    bigEndian: false] bench.
    'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
    'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false ] bench.
    'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
      }

    Here are the results:

    'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
    run.' .
    'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
    run.' .
    'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
    run.' .
    'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
    per run.' .
    'cbc largeN write'->'907,000 per second.  1.1 microseconds per
    run.' .
    'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
    run.' .
    'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
    run.' .
    'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
    run.'

    It looks like your 64-bit access is 86% faster for accessing the
    high-end of the 64-bit range, but slower in the other 3 metrics.
    Noticeably, it was only 14% as fast for writing the high-end of the
    64-bit range, and similarly as much slower for small-number access..


    On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
    <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>> wrote:
     > Hi.
     >
     > I've committed a change to the inbox with changes to allow
    getting/putting
     > 64bit values to ByteArrays (similar to 32 and 16 bit
    accessors).  Could this
     > be added to trunk?
     >
     > Also, first time I used the selective commit function - very
    nice!  the
     > changes I didn't want committed didn't, in fact, get
    commited.  Just the
     > desirable bits!
     >
     > -cbc
     >
     >
     >

-- _,,,^..^,,,_ best, Eliot

Levente Uzonyi

8 Sep 8 Sep

11:42 a.m.

Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.

| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

...

Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Chris Cunningham

5:21 p.m.

Levente,

Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?

the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote:

...

Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.

| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

Hi Chris, I think these methods belong in the image with the fastest

...
implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

Levente Uzonyi

5:34 p.m.

Hi Chris,

I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.

Levente

On Tue, 8 Sep 2015, Chris Cunningham wrote:

...

Levente, Interesting. I have a question and a concern about your implementation, though:

the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,

  A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
  "fastest implementation".
  I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

  Before:
  {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
  'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
  'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
  'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

  After:
  {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
  'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
  'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
  'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

  As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
  of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

  I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
  number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
  little-endian.

  | byteArray inputs random storageBits unsigned |
  Smalltalk garbageCollect.
  random := Random seed: 36rSqueak.
  storageBits := 64.
  unsigned := true.
  byteArray := ByteArray new: storageBits // 8 * 2.
  inputs := Array new: 100000.
  (2 to: storageBits * 2 + 1) collect: [ :descriptor |
          "lowest bit describes endianness, the rest the number of bits."
          | limit bigEndian offset |
          bigEndian := descriptor odd.
          limit := 1 << (descriptor >> 1) - 1.
          unsigned
                  ifTrue: [ offset := -1 ]
                  ifFalse: [ offset := -1- (limit >> 1) ].
          inputs replace: [ :each | (random nextInt: limit) + offset ].
          [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
                  1 to: inputs size do: [ :inputIndex |
                          byteArray
                                  unsignedLong64At: startIndex
                                          put: (inputs at: inputIndex)
                                          bigEndian: bigEndian;
                                  unsignedLong64At: startIndex
                                          bigEndian: bigEndian ] ] ] timeToRun ].

  I ran it with various accessors and got the following results:

  "short"
  #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
  #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

  "long"
  #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
  72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
  #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
  74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

  "unsigned short"
  #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
  #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

  "unsigned long"
  #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
  60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
   #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
  64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

  "unsigned long 64"
  #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
  316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
  636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
  822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
  536.109375".
  #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
  170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
  284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
  388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
  259.359375"


  Levente

  On Sun, 30 Aug 2015, Chris Muller wrote:

        Hi Chris, I think these methods belong in the image with the fastest
        implementation we can do.

        I implemented 64-bit unsigned access for Ma Serializer back in 2005.
        I modeled my implementation after Andreas' original approach which
        tries to avoid LI arithmetic.  I was curious whether your
        implementations would be faster, because if they are then it could
        benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
        trunk image, I used the following script to take comparison
        measurements:

        | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
        raisedTo: 14)) atRandom.
        largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
        maBa := ByteArray new: 8.
        cbBa := ByteArray new: 8.
        maBa maUint: 64 at: 0 put: largeN.
        cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
        self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
        bigEndian: false).
        { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
        bigEndian: false] bench.
        'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
        'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
        'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
        'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
        bigEndian: false] bench.
        'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
        'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
        'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
        }

        Here are the results:

        'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per run.' .
        'ma smallN write'->'4,770,000 per second.  210 nanoseconds per run.' .
        'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per run.' .
        'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds per run.' .
        'cbc largeN write'->'907,000 per second.  1.1 microseconds per run.' .
        'ma largeN write'->'6,620,000 per second.  151 nanoseconds per run.' .
        'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per run.' .
        'ma largeN access'->'1,020,000 per second.  982 nanoseconds per run.'

        It looks like your 64-bit access is 86% faster for accessing the
        high-end of the 64-bit range, but slower in the other 3 metrics.
        Noticeably, it was only 14% as fast for writing the high-end of the
        64-bit range, and similarly as much slower for small-number access..


        On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
        <cunningham.cb@gmail.com> wrote:
              Hi.

              I've committed a change to the inbox with changes to allow getting/putting
              64bit values to ByteArrays (similar to 32 and 16 bit accessors).  Could this
              be added to trunk?

              Also, first time I used the selective commit function - very nice!  the
              changes I didn't want committed didn't, in fact, get commited.  Just the
              desirable bits!

              -cbc

Levente Uzonyi

9 Sep 9 Sep

11:11 p.m.

After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives

{ 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}

Which means +20% speed for smallN writes, and +18% for largeN access.

Levente

P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.

On Tue, 8 Sep 2015, Levente Uzonyi wrote:

...

Hi Chris,

I added the #normalize sends to avoid the creation of spurious LargeIntegers in 64-bit images (there are two places where I relied on #-'s ability to work on unnormalized input). I didn't have a chance to test it, but I expect it to work correctly. Even if the code is sub-optimal in 64-bit, it shouldn't be any slower than in 32-bit.

Levente

On Tue, 8 Sep 2015, Chris Cunningham wrote:

...
Levente, Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?

the concern is that your code is nicely optimized for our current 32bit vm

but once we go to 64 bit, I think it will fail. Should we be

concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
  A bit later than I wanted to, but I've finally uploaded my versions 
to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
  Before:
  {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per 
run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
  After:
  {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per 
run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
  As you can see, everything became faster except for smallN access. 
This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
  I also wrote a benchmark which measures reads and writes together. It 
generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
  | byteArray inputs random storageBits unsigned |
  Smalltalk garbageCollect.
  random := Random seed: 36rSqueak.
  storageBits := 64.
  unsigned := true.
  byteArray := ByteArray new: storageBits // 8 * 2.
  inputs := Array new: 100000.
  (2 to: storageBits * 2 + 1) collect: [ :descriptor |
          "lowest bit describes endianness, the rest the number of 
bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
  I ran it with various accessors and got the following results:

  "short"
  #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 
28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
  "long"
  #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 
70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
  "unsigned short"
  #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 
20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
  "unsigned long"
  #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 
52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
  "unsigned long 64"
  #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 
306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
  Levente

  On Sun, 30 Aug 2015, Chris Muller wrote:

        Hi Chris, I think these methods belong in the image with the 
fastest implementation we can do.
        I implemented 64-bit unsigned access for Ma Serializer back in 
  I modeled my implementation after Andreas' original approach 
which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
        | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
        raisedTo: 14)) atRandom.
        largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
        maBa := ByteArray new: 8.
        cbBa := ByteArray new: 8.
        maBa maUint: 64 at: 0 put: largeN.
        cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
        self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 
1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
        Here are the results:

        'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per 
run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
        It looks like your 64-bit access is 86% faster for accessing 
the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
        On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
        <cunningham.cb@gmail.com> wrote:
              Hi.

              I've committed a change to the inbox with changes to 
allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
              Also, first time I used the selective commit function - 
very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
              -cbc

Eliot Miranda

10 Sep 10 Sep

1:25 a.m.

Hi Levente,

Sent from my iPhone

...

On Sep 9, 2015, at 2:11 PM, Levente Uzonyi leves@elte.hu wrote:

After thinking a bit more about this, I came to the conclusion that while I paid attention to handle reader methods in 64-bit images correctly, some of the writer methods are incorrect for "large" SmallInteger inputs. First I implemented the missing parts for 61-bit SmallIntegers using a general purpose loop, but it turned out that the loop is significantly faster for smaller numbers, and only slightly slower for larger ones, so I decided to nuke the special case for 30-bit SmallIntegers. I also found further possiblities to optimize writes, so the average is down from 259 to 236 milliseconds. Chris Muller's benchmark gives

{ 'cbc smallN write'->'12,200,000 per second. 81.7 nanoseconds per run.' . 'cbc smallN access'->'10,300,000 per second. 96.8 nanoseconds per run.' . 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.' . 'cbc largeN access'->'5,270,000 per second. 190 nanoseconds per run.'}

Which means +20% speed for smallN writes, and +18% for largeN access.

Levente

P.S.: I still couldn't test the code in my 64-bit Spur image, because it won't respond to any input after startup.

IIRC that's due to a bug in earlier versions of the bootstrap in converting negative integers, or perhaps larger positive smallintegers. Anyway I *think* that the latest Spur 64 but VM image combination works properly now. Apologies.

...

On Tue, 8 Sep 2015, Levente Uzonyi wrote:

Hi Chris,

Levente

...

On Tue, 8 Sep 2015, Chris Cunningham wrote:

Levente, Interesting. I have a question and a concern about your implementation, though: Question: why in the micro checks, is the Write faster than the Access: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access? "unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas? the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned? -cbc On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,

 A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
 "fastest implementation".
 I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

 Before:
 {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

 After:
 {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

 As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
 of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

 I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
 number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
 little-endian.

 | byteArray inputs random storageBits unsigned |
 Smalltalk garbageCollect.
 random := Random seed: 36rSqueak.
 storageBits := 64.
 unsigned := true.
 byteArray := ByteArray new: storageBits // 8 * 2.
 inputs := Array new: 100000.
 (2 to: storageBits * 2 + 1) collect: [ :descriptor |
         "lowest bit describes endianness, the rest the number of bits."
         | limit bigEndian offset |
         bigEndian := descriptor odd.
         limit := 1 << (descriptor >> 1) - 1.
         unsigned
                 ifTrue: [ offset := -1 ]
                 ifFalse: [ offset := -1- (limit >> 1) ].
         inputs replace: [ :each | (random nextInt: limit) + offset ].
         [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
                 1 to: inputs size do: [ :inputIndex |
                         byteArray
                                 unsignedLong64At: startIndex
                                         put: (inputs at: inputIndex)
                                         bigEndian: bigEndian;
                                 unsignedLong64At: startIndex
                                         bigEndian: bigEndian ] ] ] timeToRun ].

 I ran it with various accessors and got the following results:

 "short"
 #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
 #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

 "long"
 #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
 #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

 "unsigned short"
 #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
 #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

 "unsigned long"
 #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
  #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

 "unsigned long 64"
 #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
 536.109375".
 #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
 259.359375"

 Levente

 On Sun, 30 Aug 2015, Chris Muller wrote:

       Hi Chris, I think these methods belong in the image with the fastest
       implementation we can do.

       I implemented 64-bit unsigned access for Ma Serializer back in 2005.
       I modeled my implementation after Andreas' original approach which
       tries to avoid LI arithmetic.  I was curious whether your
       implementations would be faster, because if they are then it could
       benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
       trunk image, I used the following script to take comparison
       measurements:

       | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
       raisedTo: 14)) atRandom.
       largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
       maBa := ByteArray new: 8.
       cbBa := ByteArray new: 8.
       maBa maUint: 64 at: 0 put: largeN.
       cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
       self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
       bigEndian: false).
       { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
       bigEndian: false] bench.
       'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
       'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
       'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
       'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
       bigEndian: false] bench.
       'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
       'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
       'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
       }

       Here are the results:

       'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per run.' .
       'ma smallN write'->'4,770,000 per second.  210 nanoseconds per run.' .
       'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per run.' .
       'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds per run.' .
       'cbc largeN write'->'907,000 per second.  1.1 microseconds per run.' .
       'ma largeN write'->'6,620,000 per second.  151 nanoseconds per run.' .
       'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per run.' .
       'ma largeN access'->'1,020,000 per second.  982 nanoseconds per run.'

       It looks like your 64-bit access is 86% faster for accessing the
       high-end of the 64-bit range, but slower in the other 3 metrics.
       Noticeably, it was only 14% as fast for writing the high-end of the
       64-bit range, and similarly as much slower for small-number access..

       On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
       <cunningham.cb@gmail.com> wrote:
             Hi.

             I've committed a change to the inbox with changes to allow getting/putting
             64bit values to ByteArrays (similar to 32 and 16 bit accessors).  Could this
             be added to trunk?

             Also, first time I used the selective commit function - very nice!  the
             changes I didn't want committed didn't, in fact, get commited.  Just the
             desirable bits!

             -cbc

Levente Uzonyi

8 Sep 8 Sep

6:25 p.m.

I forgot to answer your other questions.

On Tue, 8 Sep 2015, Chris Cunningham wrote:

...

Levente, Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access:

Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes.

...

{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access?

For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine.

...

"unsigned long 64" (put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?

So this is

(read and write Before) "average asFloat 536.109375" (read and write After) "average asFloat 259.359375"

Levente

...

the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,

  A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the
  "fastest implementation".
  I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

  Before:
  {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'.
  'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'.
  'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'.
  'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

  After:
  {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.
  'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'.
  'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'.
  'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

  As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead
  of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.

  I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given
  number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to
  little-endian.

  | byteArray inputs random storageBits unsigned |
  Smalltalk garbageCollect.
  random := Random seed: 36rSqueak.
  storageBits := 64.
  unsigned := true.
  byteArray := ByteArray new: storageBits // 8 * 2.
  inputs := Array new: 100000.
  (2 to: storageBits * 2 + 1) collect: [ :descriptor |
          "lowest bit describes endianness, the rest the number of bits."
          | limit bigEndian offset |
          bigEndian := descriptor odd.
          limit := 1 << (descriptor >> 1) - 1.
          unsigned
                  ifTrue: [ offset := -1 ]
                  ifFalse: [ offset := -1- (limit >> 1) ].
          inputs replace: [ :each | (random nextInt: limit) + offset ].
          [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex |
                  1 to: inputs size do: [ :inputIndex |
                          byteArray
                                  unsignedLong64At: startIndex
                                          put: (inputs at: inputIndex)
                                          bigEndian: bigEndian;
                                  unsignedLong64At: startIndex
                                          bigEndian: bigEndian ] ] ] timeToRun ].

  I ran it with various accessors and got the following results:

  "short"
  #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625".
  #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

  "long"
  #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74
  72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125".
  #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74
  74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

  "unsigned short"
  #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5".
  #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

  "unsigned long"
  #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58
  60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625".
   #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64
  64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

  "unsigned long 64"
  #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318
  316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618
  636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812
  822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat
  536.109375".
  #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164
  170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344
  284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304
  388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat
  259.359375"


  Levente

  On Sun, 30 Aug 2015, Chris Muller wrote:

        Hi Chris, I think these methods belong in the image with the fastest
        implementation we can do.

        I implemented 64-bit unsigned access for Ma Serializer back in 2005.
        I modeled my implementation after Andreas' original approach which
        tries to avoid LI arithmetic.  I was curious whether your
        implementations would be faster, because if they are then it could
        benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
        trunk image, I used the following script to take comparison
        measurements:

        | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
        raisedTo: 14)) atRandom.
        largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
        maBa := ByteArray new: 8.
        cbBa := ByteArray new: 8.
        maBa maUint: 64 at: 0 put: largeN.
        cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
        self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
        bigEndian: false).
        { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
        bigEndian: false] bench.
        'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
        'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench.
        'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
        'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
        bigEndian: false] bench.
        'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
        'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench.
        'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
        }

        Here are the results:

        'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per run.' .
        'ma smallN write'->'4,770,000 per second.  210 nanoseconds per run.' .
        'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per run.' .
        'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds per run.' .
        'cbc largeN write'->'907,000 per second.  1.1 microseconds per run.' .
        'ma largeN write'->'6,620,000 per second.  151 nanoseconds per run.' .
        'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per run.' .
        'ma largeN access'->'1,020,000 per second.  982 nanoseconds per run.'

        It looks like your 64-bit access is 86% faster for accessing the
        high-end of the 64-bit range, but slower in the other 3 metrics.
        Noticeably, it was only 14% as fast for writing the high-end of the
        64-bit range, and similarly as much slower for small-number access..


        On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
        <cunningham.cb@gmail.com> wrote:
              Hi.

              I've committed a change to the inbox with changes to allow getting/putting
              64bit values to ByteArrays (similar to 32 and 16 bit accessors).  Could this
              be added to trunk?

              Also, first time I used the selective commit function - very nice!  the
              changes I didn't want committed didn't, in fact, get commited.  Just the
              desirable bits!

              -cbc

Chris Cunningham

9 Sep 9 Sep

12:11 a.m.

Ok, it makes sense now. Well, it made sense before, I was just slow to see it.

-cbc

On Tue, Sep 8, 2015 at 9:25 AM, Levente Uzonyi leves@elte.hu wrote:

...

I forgot to answer your other questions.

On Tue, 8 Sep 2015, Chris Cunningham wrote:

Levente,

...
Interesting. I have a question and a concern about your implementation, though:

Question: why in the micro checks, is the Write faster than the Access:

Because Access means allocation of new objects (LargeIntegers), while Write simply means copying bytes.

{'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'.

...
'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}. yet in your more thorough benchmark, the Write twice as slow as the Access?

For each pair of lines the first one was measured _before_ my changes, the second one was measured _after_ them. The benchmark measures read and write together, so 536.109375 stands for the average number of milliseconds to read and write ninemillion numbers using your variant of the 64-bit methods on my machine.

"unsigned long 64"

...
(put, or Write) "average asFloat 536.109375". (Access) "average asFloat 259.359375" any ideas?

So this is

(read and write Before) "average asFloat 536.109375" (read and write After) "average asFloat 259.359375"

Levente

...
the concern is that your code is nicely optimized for our current 32bit vm - but once we go to 64 bit, I think it will fail. Should we be concerned?

-cbc

On Tue, Sep 8, 2015 at 2:42 AM, Levente Uzonyi leves@elte.hu wrote: Hi All,
  A bit later than I wanted to, but I've finally uploaded my versions
to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:
  Before:
  {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per
run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.
  After:
  {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per
run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.
  As you can see, everything became faster except for smallN access.
This is the side-effect of optimizing for the average case instead of specific cases - like zero bytes. I decided not to use that trick, because it decreased the overall performance.
  I also wrote a benchmark which measures reads and writes together.
It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.
  | byteArray inputs random storageBits unsigned |
  Smalltalk garbageCollect.
  random := Random seed: 36rSqueak.
  storageBits := 64.
  unsigned := true.
  byteArray := ByteArray new: storageBits // 8 * 2.
  inputs := Array new: 100000.
  (2 to: storageBits * 2 + 1) collect: [ :descriptor |
          "lowest bit describes endianness, the rest the number of
bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].
  I ran it with various accessors and got the following results:

  "short"
  #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28
28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".
  "long"
  #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72
70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".
  "unsigned short"
  #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20
20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".
  "unsigned long"
  #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54
52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".
  "unsigned long 64"
  #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300
312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"
  Levente

  On Sun, 30 Aug 2015, Chris Muller wrote:

        Hi Chris, I think these methods belong in the image with the
fastest implementation we can do.
        I implemented 64-bit unsigned access for Ma Serializer back
in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
        | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to:
(2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
        Here are the results:

        'cbc smallN write'->'3,110,000 per second.  322 nanoseconds
per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
        It looks like your 64-bit access is 86% faster for accessing
the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
        On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
        <cunningham.cb@gmail.com> wrote:
              Hi.

              I've committed a change to the inbox with changes to
allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?
              Also, first time I used the selective commit function -
very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!
              -cbc

Chris Muller

8 Sep 8 Sep

9:30 p.m.

(Ha! What an amazing thread!) Levente, why am I not surprised that you could manage to squeeze yet even more efficiency out of it. If there were an eXtreme sports for programming, you'd win the gold medal... :)

I've got to try this on the MagmaBenchmarker right now, I'll let you know in a bit..

On Tue, Sep 8, 2015 at 4:42 AM, Levente Uzonyi leves@elte.hu wrote:

...

Hi All,

A bit later than I wanted to, but I've finally uploaded my versions to the Trunk. I guess I went as far as possible with getting the "fastest implementation". I modified your benchmark to use the same numbers, so that the measurements could be repeated. I got the following:

Before: {'cbc smallN write'->'3,710,000 per second. 269 nanoseconds per run.'. 'cbc smallN access'->'12,000,000 per second. 83.4 nanoseconds per run.'. 'cbc largeN write'->'5,430,000 per second. 184 nanoseconds per run.'. 'cbc largeN access'->'1,370,000 per second. 732 nanoseconds per run.'}.

After: {'cbc smallN write'->'10,400,000 per second. 95.8 nanoseconds per run.'. 'cbc smallN access'->'10,300,000 per second. 97.4 nanoseconds per run.'. 'cbc largeN write'->'12,400,000 per second. 80.4 nanoseconds per run.'. 'cbc largeN access'->'3,920,000 per second. 255 nanoseconds per run.'}.

As you can see, everything became faster except for smallN access. This is the side-effect of optimizing for the average case instead of specific cases

like zero bytes. I decided not to use that trick, because it decreased the

overall performance.

I also wrote a benchmark which measures reads and writes together. It generates random numbers which can be represented using a given number of bits. The result is an array of run times where values having and odd index belong to big-endian access, and even ones to little-endian.

| byteArray inputs random storageBits unsigned | Smalltalk garbageCollect. random := Random seed: 36rSqueak. storageBits := 64. unsigned := true. byteArray := ByteArray new: storageBits // 8 * 2. inputs := Array new: 100000. (2 to: storageBits * 2 + 1) collect: [ :descriptor | "lowest bit describes endianness, the rest the number of bits." | limit bigEndian offset | bigEndian := descriptor odd. limit := 1 << (descriptor >> 1) - 1. unsigned ifTrue: [ offset := -1 ] ifFalse: [ offset := -1- (limit >> 1) ]. inputs replace: [ :each | (random nextInt: limit) + offset ]. [ 1 to: byteArray size - (storageBits // 8 - 1) do: [ :startIndex | 1 to: inputs size do: [ :inputIndex | byteArray unsignedLong64At: startIndex put: (inputs at: inputIndex) bigEndian: bigEndian; unsignedLong64At: startIndex bigEndian: bigEndian ] ] ] timeToRun ].

I ran it with various accessors and got the following results:

"short" #(28 28 26 26 26 28 26 28 26 28 28 28 26 28 28 28 28 28 28 30 28 28 28 28 28 28 28 28 26 28 28 28) "average asFloat 27.625". #(16 18 18 20 18 20 20 20 18 20 18 18 20 20 20 20 20 20 20 20 18 20 20 20 20 20 20 22 20 20 20 20) "average asFloat 19.5".

"long" #(62 62 66 68 68 70 68 70 68 70 68 70 68 70 68 70 68 70 70 74 70 72 70 72 72 74 72 72 70 74 70 72 70 72 72 76 72 76 72 76 72 76 72 74 72 76 70 76 72 76 70 76 72 76 72 74 72 76 72 74 72 76 570 584) "average asFloat 87.28125". #(66 66 70 70 72 72 72 72 72 72 74 72 72 74 72 72 74 72 74 72 72 72 72 72 74 72 74 72 72 72 72 74 72 74 72 72 72 72 72 74 74 72 72 74 74 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 74 72 116 122) "average asFloat 73.625".

"unsigned short" #(18 18 18 20 16 18 18 18 18 18 18 18 18 20 18 20 18 18 18 18 18 20 20 20 20 20 18 20 18 18 18 18) "average asFloat 18.5". #(18 18 18 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18) "average asFloat 18.125".

"unsigned long" #(46 48 48 50 50 50 48 48 50 48 48 48 46 48 46 48 52 54 52 52 52 54 52 54 52 52 54 54 52 54 52 54 58 58 58 58 58 58 58 58 58 58 56 58 60 58 56 56 60 62 60 62 62 62 60 62 60 62 62 62 384 400 520 694) "average asFloat 82.40625". #(62 62 62 64 64 62 62 62 62 64 64 64 64 64 64 64 62 62 64 62 64 62 64 64 64 64 64 64 64 64 64 64 64 64 62 62 64 64 64 64 62 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 62 100 108 106 298) "average asFloat 69.09375".

"unsigned long 64" #(300 300 300 300 300 300 300 300 300 300 300 300 300 298 302 300 312 306 308 310 308 306 308 308 310 308 308 308 310 308 312 308 318 316 314 318 316 316 318 316 318 316 316 316 318 318 316 316 326 324 326 322 326 322 328 324 326 322 326 322 510 520 592 592 634 618 636 640 652 666 642 644 660 648 642 660 652 646 662 658 636 648 626 632 650 628 632 612 632 620 622 636 626 626 644 632 750 748 812 822 828 858 842 862 898 880 896 840 870 896 926 870 1034 846 880 834 876 824 860 818 848 824 826 864 820 848 820 828) "average asFloat 536.109375". #(166 174 168 174 170 176 168 172 166 172 164 170 166 170 166 172 166 170 166 172 166 172 166 170 166 170 164 170 170 170 168 176 164 170 166 172 166 172 164 174 166 170 168 172 166 172 166 172 166 170 164 170 166 172 164 172 166 172 166 170 238 272 264 484 282 344 284 356 292 362 294 364 288 362 292 366 294 368 290 364 294 374 294 374 296 370 294 374 288 370 290 366 290 368 292 364 302 382 304 388 302 390 298 392 298 384 302 388 302 390 298 386 308 398 304 400 504 402 298 402 298 398 302 398 294 400 298 396). "average asFloat 259.359375"

Levente

On Sun, 30 Aug 2015, Chris Muller wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

3166

Age (days ago)

3178

Last active (days ago)

squeak-dev@lists.squeakfoundation.org

15 comments

6 participants

tags (0)

participants (6)

Andres Valloud
Chris Cunningham
Chris Muller
David T. Lewis
Eliot Miranda
Levente Uzonyi