Hi All, I see that Float 32-bit word order is big-endian (PowerPC) on all platforms. This is a pain for performance and a pain for code generation in Cog. For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result. Yes, ok 2 extra instructions is small potatoes, but they're still starch. So I wonder what would the impact be of maintaining Floats in platform order? There are a number of possible solutions.
1. Floats are always in platform order and swizzled on image load when moving from little-endian to big-endian or vice verce. Image code must be rewritten to take the platform's endianness into account. (requires an image rewrite)
2. As for 1 but the image is isolated from the change by providing two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float. These primitives map index 1 onto the most significant word and index 2 onto the least significant word. (requires no image rewrite, but does require a file-in of the four implementations)
3. as for 1 but the image is isolated from the change by providing four primitives primitiveFloatLowWord, primitiveFloatLowWordPut primitiveFloatHighWord & primitiveFloatHighWordPut (requires as much of a rewrite of image code as 1)
4. as per 1 but provide two primitives primitiveFloatBits prmitiveFloatBitsPut which answer or store 64-bit non-negative integers. (requires as much of a rewrite of image code as 1 but is cleaner and scales to 128 bit floats)
5. modify the existing at:[put:] primitives to check for Float receivers, e.g. (and in our Qwaq images Float has a compact class index of 6) from commonVariable:at:cacheIndex: fmt < 8 ifTrue: "Bitmap (& Float!!)" [(self compactClassIndexOf: oop) == ClassFloatCompactClassIndex ifTrue: [result := self fetchLong32: 2 - index ofObject: rcvr] ifFalse: [result := self fetchLong32: index - 1 ofObject: rcvr]. ^self positive32BitIntegerFor: result]. This slows down at: access for Bitmap and complicates an already overcomplicated, and performance-critical, primitive
6. eat it. do the swizzling on every float access
6. is apparently painless but actually absurd because we're unnecessarily throwing away performance for no good reason.
5. ditto, not for Float but for Bitmap access (and Bitmap is used in the vm simulator ;) )
2. is my recommendation because it has least effort for adopters of solutions that provide maximum performance
Opinions & alternatives? Especially, what are the likely issues of moving to platform Float order?
Best Eliot
This collide with reverseBytesInImage ?
On 18-Apr-09, at 6:15 PM, Eliot Miranda wrote:
Hi All,
I see that Float 32-bit word order is big-endian (PowerPC) on
all platforms.
-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================
On Sat, Apr 18, 2009 at 6:30 PM, John M McIntosh < johnmci@smalltalkconsulting.com> wrote:
This collide with reverseBytesInImage ?
yes, but easy to fix:
(fmt = 6 and: [BytesPerWord = 8]) ifTrue: ["Object contains 32-bit half-words packed into 64-bit machine words." wordAddr := oop + BaseHeaderSize. self reverseWordsFrom: wordAddr to: oop + (self sizeBitsOf: oop)]]. => (fmt = 6 ifTrue: [(self fetchClassOfNonInt: oop) = floatClass ifTrue: [self swapWordFrom: oop + BaseHeaderSize to: oop + BaseHeaderSize + 8] ifFalse: [BytesPerWord = 8]) ifTrue: ["Object contains 32-bit half-words packed into 64-bit machine words." wordAddr := oop + BaseHeaderSize. self reverseWordsFrom: wordAddr to: oop + (self sizeBitsOf: oop)]]].
(BTW, is reverseWordsFrom:to: broken for 64-bit images?)
On 18-Apr-09, at 6:15 PM, Eliot Miranda wrote:
Hi All,
I see that Float 32-bit word order is big-endian (PowerPC) on all platforms.
--
John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ===========================================================================
2009/4/19 Eliot Miranda eliot.miranda@gmail.com:
Hi All, I see that Float 32-bit word order is big-endian (PowerPC) on all platforms. This is a pain for performance and a pain for code generation in Cog. For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result. Yes, ok 2 extra instructions is small potatoes, but they're still starch. So I wonder what would the impact be of maintaining Floats in platform order? There are a number of possible solutions.
- Floats are always in platform order and swizzled on image load when moving from little-endian to big-endian or vice verce. Image code must be rewritten to take the platform's endianness into account. (requires an image rewrite)
- As for 1 but the image is isolated from the change by providing two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float. These primitives map index 1 onto the most significant word and index 2 onto the least significant word. (requires no image rewrite, but does require a file-in of the four implementations)
- as for 1 but the image is isolated from the change by providing four primitives primitiveFloatLowWord, primitiveFloatLowWordPut primitiveFloatHighWord & primitiveFloatHighWordPut (requires as much of a rewrite of image code as 1)
- as per 1 but provide two primitives primitiveFloatBits prmitiveFloatBitsPut which answer or store 64-bit non-negative integers. (requires as much of a rewrite of image code as 1 but is cleaner and scales to 128 bit floats)
- modify the existing at:[put:] primitives to check for Float receivers, e.g. (and in our Qwaq images Float has a compact class index of 6) from commonVariable:at:cacheIndex:
fmt < 8 ifTrue: "Bitmap (& Float!!)" [(self compactClassIndexOf: oop) == ClassFloatCompactClassIndex ifTrue: [result := self fetchLong32: 2 - index ofObject: rcvr] ifFalse: [result := self fetchLong32: index - 1 ofObject: rcvr]. ^self positive32BitIntegerFor: result]. This slows down at: access for Bitmap and complicates an already overcomplicated, and performance-critical, primitive 6. eat it. do the swizzling on every float access
- is apparently painless but actually absurd because we're unnecessarily throwing away performance for no good reason.
- ditto, not for Float but for Bitmap access (and Bitmap is used in the vm simulator ;) )
- is my recommendation because it has least effort for adopters of solutions that provide maximum performance
Opinions & alternatives? Especially, what are the likely issues of moving to platform Float order? Best Eliot
Hmm.. what is the practical use of splitting 32bit float (as well as 64bit) on two words? I think , that from image side, it would be better to treat floats as a black boxes without exposing their bit order anywhere. Then we need just two primitives to serialize/deserialize them in byte array. ByteArray>> floatAt: index bigEndian: boolean ByteArray>> floatAt: index put: floatValue bigEndian: boolean (note, endianesness should be provided explicitly).
P.S. I am for the swizzling at image load.
On Sat, 2009-04-18 at 18:15 -0700, Eliot Miranda wrote:
Hi All,
I see that Float 32-bit word order is big-endian (PowerPC) on all
platforms. This is a pain for performance and a pain for code generation in Cog. For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result. Yes, ok 2 extra instructions is small potatoes, but they're still starch. So I wonder what would the impact be of maintaining Floats in platform order? There are a number of possible solutions.
- Floats are always in platform order and swizzled on image load when
moving from little-endian to big-endian or vice verce. Image code must be rewritten to take the platform's endianness into account. (requires an image rewrite)
- As for 1 but the image is isolated from the change by providing
two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float. These primitives map index 1 onto the most significant word and index 2 onto the least significant word. (requires no image rewrite, but does require a file-in of the four implementations)
I'd like to see Floats stored in native format too. Don't forget about the 32 bit floats in Float arrays.
Bryce
On Sun, Apr 19, 2009 at 6:43 AM, Bryce Kampjes bryce@kampjes.demon.co.ukwrote:
On Sat, 2009-04-18 at 18:15 -0700, Eliot Miranda wrote:
Hi All,
I see that Float 32-bit word order is big-endian (PowerPC) on all
platforms. This is a pain for performance and a pain for code generation in Cog. For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result. Yes, ok 2 extra instructions is small potatoes, but they're still starch. So I wonder what would the impact be of maintaining Floats in platform order? There are a number of possible solutions.
- Floats are always in platform order and swizzled on image load when
moving from little-endian to big-endian or vice verce. Image code must be rewritten to take the platform's endianness into account. (requires an image rewrite)
- As for 1 but the image is isolated from the change by providing
two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float. These primitives map index 1 onto the most significant word and index 2 onto the least significant word. (requires no image rewrite, but does require a file-in of the four implementations)
I'd like to see Floats stored in native format too. Don't forget about the 32 bit floats in Float arrays.
Tell me more :) Are these in some funky order, or are they just IEEE single precision in platform order?
Bryce
On Sun, Apr 19, 2009 at 07:57:20AM -0700, Eliot Miranda wrote:
On Sun, Apr 19, 2009 at 6:43 AM, Bryce Kampjes bryce@kampjes.demon.co.ukwrote:
On Sat, 2009-04-18 at 18:15 -0700, Eliot Miranda wrote:
Hi All,
I see that Float 32-bit word order is big-endian (PowerPC) on all
platforms. This is a pain for performance and a pain for code generation in Cog. For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result. Yes, ok 2 extra instructions is small potatoes, but they're still starch. So I wonder what would the impact be of maintaining Floats in platform order? There are a number of possible solutions.
- Floats are always in platform order and swizzled on image load when
moving from little-endian to big-endian or vice verce. Image code must be rewritten to take the platform's endianness into account. (requires an image rewrite)
- As for 1 but the image is isolated from the change by providing
two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float. These primitives map index 1 onto the most significant word and index 2 onto the least significant word. (requires no image rewrite, but does require a file-in of the four implementations)
I'd like to see Floats stored in native format too. Don't forget about the 32 bit floats in Float arrays.
Tell me more :) Are these in some funky order, or are they just IEEE single precision in platform order?
The attached world.png is a screen shot of a 64-bit image running on an Intel box, with hex printouts of the contents of an IntegerArray and a FloatArray (note, OopPlugin is a utility that I use for accessing the internals of object memory slots in the real object memory). This shows the internal storage of float values in a FloatArray. I poked various values into the array so you can see where they are stored in the 64-bit object memory words.
The values in a FloatArray are 32-bit floats, packed into 64-bit slots in the object memory. There are no endian issues to worry about. On both 32-bit and 64-bit object memories, the values are arranged in the order of an (int *) access. In other words, they are arrays of 32-bit values that just happen to be stuffed onto slots that the object memory thinks are 64-bit words.
Of course, storage of 32-bit floats in FloatArray is unrelated to the original topic of Float swizzling.
(BTW, is reverseWordsFrom:to: broken for 64-bit images?)
As far as I know, there are no problems with this. The original 64-bit image was done on a big-endian box, and decendants of that image are running on my little-endian box today, so #reverseWordsFrom:to: must have worked.
Dave
On 19-Apr-09, at 8:37 AM, David T. Lewis wrote:
The values in a FloatArray are 32-bit floats, packed into 64-bit slots
in the object memory. There are no endian issues to worry about. On both 32-bit and 64-bit object memories, the values are arranged in the order of an (int *) access. In other words, they are arrays of 32-bit values that just happen to be stuffed onto slots that the object memory thinks are 64-bit words.
Well that's not quite true, you have to be careful here because might people move data in and out of the FloatArray, but let's see..
MatrixTransform2x3>>at: index put: value <primitive: 'primitiveAtPut' module: 'FloatArrayPlugin'> value isFloat ifTrue:[self basicAt: index put: value asIEEE32BitWord] ifFalse:[self at: index put: value asFloat]. ^value
CGPoint>>x: aValue self unsignedLongAt: 1 put: aValue asFloat asIEEE32BitWord bigEndian: SmalltalkImage current isBigEndian.
Ok, well the reverseBytesInImage logic I'll assume without looking is swapping the bytes in the FloatArray at load time so that accessors use SmalltalkImage current isBigEndian to move data in/out in the proper form. -- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================
On Sun, Apr 19, 2009 at 11:08:14AM -0700, John M McIntosh wrote:
On 19-Apr-09, at 8:37 AM, David T. Lewis wrote:
The values in a FloatArray are 32-bit floats, packed into 64-bit slots
in the object memory. There are no endian issues to worry about. On both 32-bit and 64-bit object memories, the values are arranged in the order of an (int *) access. In other words, they are arrays of 32-bit values that just happen to be stuffed onto slots that the object memory thinks are 64-bit words.
Well that's not quite true, you have to be careful here because might people move data in and out of the FloatArray, but let's see..
MatrixTransform2x3>>at: index put: value <primitive: 'primitiveAtPut' module: 'FloatArrayPlugin'> value isFloat ifTrue:[self basicAt: index put: value asIEEE32BitWord] ifFalse:[self at: index put: value asFloat]. ^value
CGPoint>>x: aValue self unsignedLongAt: 1 put: aValue asFloat asIEEE32BitWord bigEndian: SmalltalkImage current isBigEndian.
As near as I can tell all accesses to FloatArray and IntegerArray are on 32 bit boundaries for both 32-bit and 64-bit images, and are not impacted by host endianness.
I should mention that I have not tried FloatArrayPlugin on 64-bit images; I should probably have a look at that one of these days.
Ok, well the reverseBytesInImage logic I'll assume without looking is swapping the bytes in the FloatArray at load time so that accessors use SmalltalkImage current isBigEndian to move data in/out in the proper form.
Yes, the bytes in a FloatArray would be swapped at load time if moving from one endianness to another, but no I don't think that #isBigEndian is required for accessing the ints or floats on 32 bit boundaries.
Also, a 64-bit image containing FloatArray or IntegerArray instances should be correctly byte swapped when moved from one endianness to another, although I have never actually tried it so I can't say for sure.
Bottom line: This stuff pretty much just works, no special cases to worry about.
Dave
vm-dev@lists.squeakfoundation.org