Floating point performance
Joshua Gargus
schwa at fastmail.us
Thu Dec 14 06:16:36 UTC 2006
On Dec 13, 2006, at 4:33 PM, David Faught wrote:
> John M McIntosh wrote:
>> Could you share your messageTally. If you are using floatarray logic
>> then most of the math is done in the plugin. However
>> the plugin does not take advantage of any vector processing hardware
>> you might have so there is room for improvement.
>
> The MessageTally output is below. Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot.
> What vector processing hardware?
Like SSE or MMX on Intel, or Altivec on PowerPC.
> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!
>
>> Also if you have say a+b*c-d in smalltalk where these are float
>> array objevcts that would three primitive interactions, converting
>> that to slang would provide some performance improvements.
>
> I'm not sure I understand this statement. Is there enough overhead in
> the plugin API to justify eliminating a couple of calls, or is there
> some data representation conversion involved that could be avoided?
>
> I haven't read Andrew Greenberg's chapter on "Extending the Squeak
> Virtual Machine" in detail yet. I kind of skimmed over the sections
> "The Shape of a Smalltalk Object" and "The Anatomy of a Named
> Primitive", which I'm sure is where all the good stuff is. Are you
> saying that some performance improvement in your sample expression
> could be gained by just coding it in Slang, without translating and
> compiling it, or have I gone one step too far?
>
I see about 43% on float array arithmetic and another 4% on regular
float arithmetic. The float array arithmetic is being done on very
short arrays (3 elements), so the call overhead might be significant
(i.e. John's suggestion re: a+b*c-d might pay dividends); if you're
dealing with 10000-element float arrays, then the call overhead is
negligible.
The big problem seems to be that you're spending a lot of time
unpacking and repacking B3DVector3Arrays. If you write a primitive
to do the computation, then you can avoid all of the #at: and
#at:put: overhead, the overhead for allocating and garbage collecting
all of the intermediate B3DVectors, and much of the call overhead for
*, +, and -.
The following experiment might be helpful:
a := Vector3Array new: 1000.
b := Vector3 new.
[1000 timesRepeat: [a + a]] timeToRun. "18ms"
[1000 timesRepeat: [a += a]] timeToRun. "5ms"
[1000000 timesRepeat: [b + b]] timeToRun. "481ms"
[1000000 timesRepeat: [b += b]] timeToRun. "258ms"
In all cases we're adding a million pairs of vect3s. The first and
third are slower than the second and fourth due to the allocation of
a new target array each time; this gives you an idea of the
achievable gains if you can restructure you algorithms to re-use an
intermediate target array. The first two are faster than the last
two because they're not doing as much work in Squeak: less iterations
and fewer primitive calls. Note that these last two still don't
involve unpacking and packing arrays. Your code seems to be doing
something like:
[1000 timesRepeat: [a do: [:aa | aa + aa]]] timeToRun. "1346ms"
or perhaps
[1000 timesRepeat: [1 to: 1000 do: [:i | (a at: i ) + (a at: i)]]]
timeToRun "1286ms"
In short, it looks like there is a lot of room for improvement.
Josh
>
> - 2441 tallies, 39083 msec.
>
> **Tree**
> 100.0% {39083ms} TClothOxe>>pulse
> 77.8% {30407ms} TClothOxe>>constrain
> |77.8% {30407ms} TClothOxe>>constrain:
> | 14.2% {5550ms} B3DVector3(FloatArray)>>*
> | 13.9% {5433ms} B3DVector3(FloatArray)>>-
> | 12.2% {4768ms} B3DVector3Array>>at:
> | 9.7% {3791ms} TClothOxe>>collide
> | |9.7% {3791ms} TClothOxe>>collideSphere:
> | | 3.6% {1407ms} B3DVector3(FloatArray)>>length
> | | 3.0% {1172ms} B3DVector3(FloatArray)>>-
> | | 2.9% {1133ms} B3DVector3Array(SequenceableCollection)
> >>doWithIndex:
> | | 2.9% {1133ms}
> B3DVector3Array(SequenceableCollection)>>withIndexDo:
> | 8.8% {3439ms} B3DVector3(FloatArray)>>+
> | 6.3% {2462ms} B3DVector3Array>>at:put:
> | 5.8% {2267ms} TClothOxe>>constrainGround
> | |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
> | |2.6% {1016ms} B3DVector3>>y
> | 3.8% {1485ms} OrderedCollection>>do:
> | 2.8% {1094ms} primitives
> 7.0% {2736ms} B3DVector3Array(SequenceableCollection)
> >>replaceFrom:to:with:
> |7.0% {2736ms}
> B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
> | 2.7% {1055ms} B3DVector3Array>>at:put:
> | 2.5% {977ms} B3DVector3Array>>at:
> 4.4% {1720ms} Float>>*
> |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
> |2.0% {782ms} primitives
> 3.2% {1251ms} B3DVector3(FloatArray)>>-
> 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
> 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> **Leaves**
> 20.1% {7856ms} B3DVector3(FloatArray)>>-
> 19.8% {7738ms} B3DVector3Array>>at:
> 15.9% {6214ms} B3DVector3(FloatArray)>>*
> 11.8% {4612ms} B3DVector3Array>>at:put:
> 10.9% {4260ms} B3DVector3(FloatArray)>>+
> 3.8% {1485ms} OrderedCollection>>do:
> 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> 2.8% {1094ms} TClothOxe>>constrain:
> 2.6% {1016ms} B3DVector3>>y
> 2.0% {782ms} Float>>*
>
> **Memory**
> old +386,532 bytes
> young -551,924 bytes
> used -165,392 bytes
> free +165,392 bytes
>
> **GCs**
> full 0 totalling 0ms (0.0% uptime)
> incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
> tenures 1 (avg 7133 GCs/tenure)
> root table 0 overflows
>
More information about the Squeak-dev
mailing list
|