Floating point performance

Joshua Gargus schwa at fastmail.us
Thu Dec 14 06:16:36 UTC 2006


On Dec 13, 2006, at 4:33 PM, David Faught wrote:

> John M McIntosh wrote:
>> Could you share your messageTally. If you are using floatarray logic
>> then most of the math is done in the plugin. However
>> the plugin does not take advantage of any vector processing hardware
>> you might have so there is room for improvement.
>
> The MessageTally output is below.  Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot.
> What vector processing hardware?

Like SSE or MMX on Intel, or Altivec on PowerPC.

> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!
>
>> Also if  you have say a+b*c-d  in smalltalk where these are float
>> array objevcts that would three primitive  interactions, converting
>> that to slang would provide some performance improvements.
>
> I'm not sure I understand this statement.  Is there enough overhead in
> the plugin API to justify eliminating a couple of calls, or is there
> some data representation conversion involved that could be avoided?
>
> I haven't read Andrew Greenberg's chapter on "Extending the Squeak
> Virtual Machine" in detail yet.  I kind of skimmed over the sections
> "The Shape of a Smalltalk Object" and "The Anatomy of a Named
> Primitive", which I'm sure is where all the good stuff is.  Are you
> saying that some performance improvement in your sample expression
> could be gained by just coding it in Slang, without translating and
> compiling it, or have I gone one step too far?
>

I see about 43% on float array arithmetic and another 4% on regular  
float arithmetic.  The float array arithmetic is being done on very  
short arrays (3 elements), so the call overhead might be significant  
(i.e. John's suggestion re: a+b*c-d might pay dividends); if you're  
dealing with 10000-element float arrays, then the call overhead is  
negligible.

The big problem seems to be that you're spending a lot of time  
unpacking and repacking B3DVector3Arrays.  If you write a primitive  
to do the computation, then you can avoid all of the #at: and  
#at:put: overhead, the overhead for allocating and garbage collecting  
all of the intermediate B3DVectors, and much of the call overhead for  
*, +, and -.

The following experiment might be helpful:
a := Vector3Array new: 1000.
b := Vector3 new.
[1000 timesRepeat: [a + a]] timeToRun. "18ms"
[1000 timesRepeat: [a += a]] timeToRun. "5ms"
[1000000 timesRepeat: [b + b]] timeToRun. "481ms"
[1000000 timesRepeat: [b += b]] timeToRun. "258ms"

In all cases we're adding a million pairs of vect3s.  The first and  
third are slower than the second and fourth due to the allocation of  
a new target array each time; this gives you an idea of the  
achievable gains if you can restructure you algorithms to re-use an  
intermediate target array.  The first two are faster than the last  
two because they're not doing as much work in Squeak: less iterations  
and fewer primitive calls.  Note that these last two still don't  
involve unpacking and packing arrays.  Your code seems to be doing  
something like:

[1000 timesRepeat: [a do: [:aa | aa + aa]]] timeToRun.  "1346ms"
or perhaps
[1000 timesRepeat: [1 to: 1000 do: [:i | (a at: i ) + (a at: i)]]]  
timeToRun "1286ms"

In short, it looks like there is a lot of room for improvement.

Josh

>
> - 2441 tallies, 39083 msec.
>
> **Tree**
> 100.0% {39083ms} TClothOxe>>pulse
>  77.8% {30407ms} TClothOxe>>constrain
>    |77.8% {30407ms} TClothOxe>>constrain:
>    |  14.2% {5550ms} B3DVector3(FloatArray)>>*
>    |  13.9% {5433ms} B3DVector3(FloatArray)>>-
>    |  12.2% {4768ms} B3DVector3Array>>at:
>    |  9.7% {3791ms} TClothOxe>>collide
>    |    |9.7% {3791ms} TClothOxe>>collideSphere:
>    |    |  3.6% {1407ms} B3DVector3(FloatArray)>>length
>    |    |  3.0% {1172ms} B3DVector3(FloatArray)>>-
>    |    |  2.9% {1133ms} B3DVector3Array(SequenceableCollection) 
> >>doWithIndex:
>    |    |    2.9% {1133ms}
> B3DVector3Array(SequenceableCollection)>>withIndexDo:
>    |  8.8% {3439ms} B3DVector3(FloatArray)>>+
>    |  6.3% {2462ms} B3DVector3Array>>at:put:
>    |  5.8% {2267ms} TClothOxe>>constrainGround
>    |    |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
>    |    |2.6% {1016ms} B3DVector3>>y
>    |  3.8% {1485ms} OrderedCollection>>do:
>    |  2.8% {1094ms} primitives
>  7.0% {2736ms} B3DVector3Array(SequenceableCollection) 
> >>replaceFrom:to:with:
>    |7.0% {2736ms}
> B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
>    |  2.7% {1055ms} B3DVector3Array>>at:put:
>    |  2.5% {977ms} B3DVector3Array>>at:
>  4.4% {1720ms} Float>>*
>    |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
>    |2.0% {782ms} primitives
>  3.2% {1251ms} B3DVector3(FloatArray)>>-
>  2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
>    2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> **Leaves**
> 20.1% {7856ms} B3DVector3(FloatArray)>>-
> 19.8% {7738ms} B3DVector3Array>>at:
> 15.9% {6214ms} B3DVector3(FloatArray)>>*
> 11.8% {4612ms} B3DVector3Array>>at:put:
> 10.9% {4260ms} B3DVector3(FloatArray)>>+
> 3.8% {1485ms} OrderedCollection>>do:
> 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> 2.8% {1094ms} TClothOxe>>constrain:
> 2.6% {1016ms} B3DVector3>>y
> 2.0% {782ms} Float>>*
>
> **Memory**
> 	old			+386,532 bytes
> 	young		-551,924 bytes
> 	used		-165,392 bytes
> 	free		+165,392 bytes
>
> **GCs**
> 	full			0 totalling 0ms (0.0% uptime)
> 	incr		7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
> 	tenures		1 (avg 7133 GCs/tenure)
> 	root table	0 overflows
>




More information about the Squeak-dev mailing list