Hi All,<div><br></div><div>    I see that Float 32-bit word order is big-endian (PowerPC) on all platforms.  This is a pain for performance and a pain for code generation in Cog.  For example using SSE2 instructions it is trivial to swizzle a PowerPC-layout Float into an xmm register using the PSHUFD SSE2 instruction but tediously verbose to swizzle on write, because one has to swizzle to an xmm register which is hence destructive, which means three instructions (shuffle, write, unshuffle) just to write a Float result.  Yes, ok 2 extra instructions is small potatoes, but they&#39;re still starch.  So I wonder what would the impact be of maintaining Floats in platform order?  There are a number of possible solutions.</div>

<div><br></div><div>1. Floats are always in platform order and swizzled on image load when moving from little-endian to big-endian or vice verce.  Image code must be rewritten to take the platform&#39;s endianness into account. (requires an image rewrite)</div>

<div><br></div><div>2.  As for 1 but the image is isolated from the change by providing two primitives, primitiveFloatAt and primitiveFloatAtPut which are implemented with selectors at: basicAt: at:put: and basicAt:put: on Float.  These primitives map index 1 onto the most significant word and index 2 onto the least significant word.  (requires no image rewrite, but does require a file-in of the four implementations)</div>

<div><br></div><div>3. as for 1 but the image is isolated from the change by providing four primitives primitiveFloatLowWord, primitiveFloatLowWordPut primitiveFloatHighWord &amp; primitiveFloatHighWordPut (requires as much of a rewrite of image code as 1)</div>

<div><br></div><div>4. as per 1 but provide two primitives primitiveFloatBits prmitiveFloatBitsPut which answer or store 64-bit non-negative integers. (requires as much of a rewrite of image code as 1 but is cleaner and scales to 128 bit floats)</div>

<div><br></div><div>5. modify the existing at:[put:] primitives to check for Float receivers, e.g. (and in our Qwaq images Float has a compact class index of 6) from commonVariable:at:cacheIndex:</div><div><div><span class="Apple-tab-span" style="white-space:pre">                </span>fmt &lt; 8 ifTrue:  &quot;Bitmap (&amp; Float!!)&quot;</div>

<div><span class="Apple-tab-span" style="white-space:pre">                        </span>[(self compactClassIndexOf: oop) == ClassFloatCompactClassIndex</div><div>                            ifTrue: [result := self fetchLong32: 2 - index ofObject: rcvr]</div>

<div>                            ifFalse: [result := self fetchLong32: index - 1 ofObject: rcvr].</div><div><span class="Apple-tab-span" style="white-space:pre">                        </span> ^self positive32BitIntegerFor: result].</div><div>

This slows down at: access for Bitmap and complicates an already overcomplicated, and performance-critical, primitive</div><div><br></div><div>6. eat it.  do the swizzling on every float access</div><div><br></div><div><br>

</div><div>6. is apparently painless but actually absurd because we&#39;re unnecessarily throwing away performance for no good reason.</div><div><br></div><div>5. ditto, not for Float but for Bitmap access (and Bitmap is used in the vm simulator ;) )</div>

<div><br></div><div>2. is my recommendation because it has least effort for adopters of solutions that provide maximum performance</div><div><br></div><div>Opinions &amp; alternatives?  Especially, what are the likely issues of moving to platform Float order?</div>

<div><br></div><div>Best</div><div>Eliot</div></div>