[squeak-dev] Matrix Performance

Thu Jun 24 21:24:26 UTC 2010

2010/6/24 Jimmie Houchin <jdev at cyberhaus.us>:
> On 6/21/2010 5:34 PM, Levente Uzonyi wrote:
>>
>> On Mon, 21 Jun 2010, Jimmie Houchin wrote:
>>>
>>> On 6/21/2010 9:27 AM, Levente Uzonyi wrote:
>>>>
>>>> On Sun, 20 Jun 2010, Jimmie Houchin wrote:
>>>>>
>>>>> [snip]
>>>>> Here is some samples from my experimentation.
>>>>>
>>>>> Some of what I am doing is doing rolling calculations over my dataset.
>>>>>
>>>>> dataset is one weeks worth of OHLC data of a currency pair.
>>>>>
>>>>> In Squeak I have.
>>>>>
>>>>> ttr := [
>>>>>  1 to: ((m rowCount) -500) do: [:i || row rowSum rowMax rowMin
>>>>> rowMedian rowAverage |
>>>>>  row := (m atRows: i to: (499+i) columns: 5 to: 5).
>>>>>  rowSum := row sum.
>>>>>  rowMax := row max.
>>>>>  rowMin := row min.
>>>>>  rowMedian := row median.
>>>>>  rowAverage := row average.
>>>>>  omd add: {rowSum . rowMax . rowMin . rowMedian . rowAverage}]]
>>>>> timeToRun.
>>>>>
>>>>> Squeak:  17 seconds,  with Cog 4.2 seconds  (nice work guys
>>>>> (Eliot/Teleplace)
>>>>
>>>> This code can be implemented a lot more efficiently.
>>>> #atRows:to:columns:to: creates a new matrix, but that can be avoided.
>>>> #sum, #max, #min, #median and #average iterate over the row. What's
>>>> worse, #median sorts the row. These can be elimated too.
>>>> The total runtime cost is: O((r - w) * (w + w * log(w))), which is O(m *
>>>> w * log(w)) if m >> w, which is true in your case. (r is the number of rows,
>>>> w is the window size (500 in your example)).
>>>> This can be reduced to m*log(w) which is 500x speedup (in theory,
>>>> ignoring constatns) in your case.
>>>>
>>>> The idea is to store the intermediate results. The sum (and average)
>>>> only require a single variable which stores the sum of the window. Then
>>>> substract the element getting out of the window and add the new element
>>>> getting in the window and you got the new sum and average. Min, max and
>>>> median are a bit more tricky, but a balanced binary tree handle them.
>>>> Calculating min and max in a balanced tree requires O(log(n)) time (which is
>>>> O(log(w)) in your case). Adding and removing 1-1 elements also require
>>>> O(log(w)) time. For median, you have to find the node of the median of the
>>>> first 500 elements at the beginning. When an element is removed or added to
>>>> the tree, the median will be the same, the next or the previous element in
>>>> the tree, depending on the median, the added/removed element and the size of
>>>> the tree. This can be handled in O(log(w)) time.
>>>>
>>>> For a matrix with 10000 rows and 5 columns of random floats your code
>>>> takes 5.5 seconds with Cog. Using a temp for sum and average and a tree for
>>>> min and max (without the median, because it requires a special tree
>>>> implementation) it takes ~35ms. That's 157x speedup.
>>>>
>>>> Levente
>>>
>>> Hello Levente,
>>>
>>> I am not surprised that I may not have the most efficient implementation.
>>> I understand what you are saying in principle, but I don't understand how to
>>> implement what you are saying. Can you provide an example doing what I did
>>> in the manner you describe. I would greatly appreciate it. I would then run
>>> it against my data for  a test.
>>
>> Hi Jimmie,
>>
>> if you load http://leves.web.elte.hu/squeak/LLRBT-ul.7.mcz [1] then you
>> can evaluate the following example in a workspace. Just print it:
>>
>> | rng matrix windowSize columnIndex old new oldTime newTime |
>> "Create a matrix filled with random float values between 0 and 1 with
>> 10000 rows."
>> rng := Random new.
>> matrix := Matrix rows: 10000 columns: 5 tabulate: [ :i :j | rng next ].
>> windowSize := 500.
>> columnIndex := 5.
>> "The 'old' slow computation."
>> oldTime := [ old := Array new: matrix rowCount - windowSize
>> streamContents: [ :stream |
>>     1 to: matrix rowCount - windowSize + 1 do: [ :i |
>>        | row |
>>        row := (matrix atRows: i to: i + windowSize - 1 columns: 5 to: 5).
>>        stream nextPut: {
>>            row sum.
>>            row max.
>>            row min.
>>            "row median."
>>            row average } ] ] ] timeToRun.
>> "The 'new' fast computation."
>> newTime := [ new := Array new: matrix rowCount - windowSize
>> streamContents: [ :stream |
>>    | sum tree |
>>    "Our variables holding the intermediate results."
>>    sum := 0.
>>    tree := LLRBTree new.
>>    "Calculate the first values."
>>    1 to: windowSize do: [ :rowIndex |
>>        | element |
>>        element := matrix at: rowIndex at: columnIndex.
>>        sum := sum + element.
>>        tree at: element put: element ].
>>    stream nextPut: { sum. tree max. tree min. sum / windowSize }.
>>    "And the rest."
>>    windowSize + 1 to: matrix rowCount do: [ :rowIndex |
>>        | oldElement newElement |
>>        oldElement := matrix at: rowIndex - windowSize at: columnIndex.
>>        newElement := matrix at: rowIndex at: columnIndex.
>>        sum := sum - oldElement + newElement.
>>        "Housekeeping for median would happen here"
>>        tree
>>            deleteKey: oldElement;
>>            at: newElement put: newElement.
>>        stream nextPut: { sum. tree max. tree min. sum / windowSize } ] ] ]
>> timeToRun.
>> "Make sure the calculations give the same results."
>> old with: new do: [ :oldRow :newRow |
>>    oldRow with: newRow do: [ :oldValue :newValue |
>>        self assert: (oldValue closeTo: newValue) ] ].
>> "Print it to see the runtime"
>> { oldTime. newTime }
>>
>>
>> The "old" version is your original code with minor changes, the "new" is
>> the one that stores intermediate results. If you change the value of the
>> matrix variable, you can test it on your data.
>>
>>> The above was simply an example. I have many more methods which I've
>>> implemented which are doing a variety of moving averages and such. To my
>>> understanding, Squeak doesn't have the library of statistical methods at
>>> this time. That would be one nice thing that could be done when Numpy
>>> becomes a C lib and can be interfaced to from Squeak.
>>>
>>> I appreciate your comment above. I would really like to see Squeak out
>>> perform some of the alternatives. :)
>>
>> The same "trick" can be done in python, and I'm sure it has a sufficient
>> balanced binary tree implementation for median too.
>>
>> Levente
>
> I just recently had an opportunity to study your code to learn. I haven't
> tried it yet, but while exploring and studying I realized if I understand
> correctly a problem. Many/(most ?) of the uses of Matrix, (Numpy), etc. the
> order of the data is a part of the data. It isn't simply a set of numbers,
> but an ordered collection of numbers or data. Whether it is weather
> statistics, etc., or in my case financials which are associated to a precise
> point in time in the market.
>
> To my understanding using a tree based implementation would lose order of
> the data. I would imagine if you used keys which could be ordered and then
> order applied, you would lose the performance gains.
>
> In my simple example order had little meaning. But a common calculation is a
> weighted average which does associate a weighted value according to position
> in the array.
>
> While playing this is the implementation that performed the best for me.
> Going from 6200ms to 3600ms.
>
> m := Matrix rows: 6333 columns: 162.
> omd := OrderedCollection new.
> ttr := [| column start windowSize |
>    column := m atColumn: 5.
>    start := 1.
>    windowSize := 500.
>    1 to: ((column size) - windowSize) do: [:i || row rowSum rowMax rowMin
> rowAverage  rowMedian |
>    row := column copyFrom: i to: i+windowSize.
>    rowSum := row sum.
>    rowMax := row max.
>    rowMin := row min.
>    rowMedian := row median.
>    rowAverage := row average.
>    omd6 add: {rowSum . rowMax . rowMin . rowAverage}]] timeToRun.
> ttr 3600
>
> If I remove the median calculation  I get 594ms. You are quite correct, the
> median calculation is quite expensive. However it is quite light to the
> things I actually do. :)
>
> 594ms is quite an improvement over the original 6200ms. The original without
> median is now 2704.
> Numpy is still faster over the same data set and doing the same thing it is
> at 302ms. But Numpy is optimized C/Fortran with a Python face. It is a nice
> API but still dependent on someone doing the hard work on the  back end. The
> nice thing with the Squeak code is that it is reasonably performing and all
> Smalltalk. :)
>

For toying, educating, or implementing a few utilities maybe.
But for main algorithms (eigen decomposition, singular values etc..),
I much prefer a known to work solution, taking care both of numerical
stability and space/time efficiency, and scaling to high dimensions
whenever needed.
Redoing LAPACK is a HUGE work. The probability to fail is high.
Do you think matlab, scilab, octave, R-lab, etc... waste their time
writing a LAPACK clone ? I wouldn't.

Plus you must consider that using an optimized BLAS is the state of
the art in number crunching, maybe 3 orders of magnitude of efficiency
above a Smalltalk written loop. Even with an optimized future COG it
will still be 2...
The idea of BLAS is to get high efficiency in low level computations
if you can vector them. It's a perfect solution for implementing
higher level structures in Smalltalk while preserving decent
efficiency.
Also, Scalapack gives oppurtunity to profit by parallel architectures
for distributing load in case of large amount of computations...
and I trust BLAS will exploit GPU features when enough standardization
will be reached.

Sometimes, we just should open our eyes and take a look at external world
We don't have to throw away any code not written in Smalltalk.
If someone did some really hard work for free, then let's just grab it !
Let's be as clever as Python.
Let's open our closed world.
Let's promote FFI.
that would promote Smalltalk solutions more than a 100% Smalltalk
application would.

Well, just my POV :)

Nicolas

> Thanks for the education.
>
> Jimmie
>
>

[squeak-dev] Matrix Performance - was: Interesting survey about smalltalk