[squeak-dev] Interesting survey about smalltalk

Mon Jun 21 22:34:32 UTC 2010

On Mon, 21 Jun 2010, Jimmie Houchin wrote:

> On 6/21/2010 9:27 AM, Levente Uzonyi wrote:
>> On Sun, 20 Jun 2010, Jimmie Houchin wrote:
>>> Hello,
>>> 
>>> I would love to be using Squeak for my financial application. Numerical 
>>> performance isn't currently what is stopping me. My problem is that I 
>>> require interfacing with a Windows COM dll and in a future version with a 
>>> Java library. Hopefully at some point I will be able to port to Squeak. I 
>>> would much prefer it to using Python, which is what I am currently using.
>>> 
>>> I didn't even know Squeak was in the running until I discovered the Matrix 
>>> class. And for what I need to do it performs reasonably adequately. 
>>> However Squeak does not to my knowledge have a comprehensive collection of 
>>> mathematics methods to be able to be applied to a variety of data. 
>>> Currently I am using Python and Numpy which has a nicely optimized 
>>> Mathematics/Scientific set of functions using optimized C/Fortran 
>>> libraries. I would love to see Squeak compete in this area. In fact the 
>>> Numpy people are currently refactoring the library to turn it into a C 
>>> library usable by other languages.
>>> 
>>> Here is some samples from my experimentation.
>>> 
>>> Some of what I am doing is doing rolling calculations over my dataset.
>>> 
>>> dataset is one weeks worth of OHLC data of a currency pair.
>>> 
>>> In Squeak I have.
>>> 
>>> ttr := [
>>>  1 to: ((m rowCount) -500) do: [:i || row rowSum rowMax rowMin rowMedian 
>>> rowAverage |
>>>  row := (m atRows: i to: (499+i) columns: 5 to: 5).
>>>  rowSum := row sum.
>>>  rowMax := row max.
>>>  rowMin := row min.
>>>  rowMedian := row median.
>>>  rowAverage := row average.
>>>  omd add: {rowSum . rowMax . rowMin . rowMedian . rowAverage}]] timeToRun.
>>> 
>>> Squeak:  17 seconds,  with Cog 4.2 seconds  (nice work guys 
>>> (Eliot/Teleplace)
>> 
>> This code can be implemented a lot more efficiently. #atRows:to:columns:to: 
>> creates a new matrix, but that can be avoided.
>> #sum, #max, #min, #median and #average iterate over the row. What's worse, 
>> #median sorts the row. These can be elimated too.
>> The total runtime cost is: O((r - w) * (w + w * log(w))), which is O(m * w 
>> * log(w)) if m >> w, which is true in your case. (r is the number of rows, 
>> w is the window size (500 in your example)).
>> This can be reduced to m*log(w) which is 500x speedup (in theory, ignoring 
>> constatns) in your case.
>> 
>> The idea is to store the intermediate results. The sum (and average) only 
>> require a single variable which stores the sum of the window. Then 
>> substract the element getting out of the window and add the new element 
>> getting in the window and you got the new sum and average. Min, max and 
>> median are a bit more tricky, but a balanced binary tree handle them. 
>> Calculating min and max in a balanced tree requires O(log(n)) time (which 
>> is O(log(w)) in your case). Adding and removing 1-1 elements also require 
>> O(log(w)) time. For median, you have to find the node of the median of the 
>> first 500 elements at the beginning. When an element is removed or added to 
>> the tree, the median will be the same, the next or the previous element in 
>> the tree, depending on the median, the added/removed element and the size 
>> of the tree. This can be handled in O(log(w)) time.
>> 
>> For a matrix with 10000 rows and 5 columns of random floats your code takes 
>> 5.5 seconds with Cog. Using a temp for sum and average and a tree for min 
>> and max (without the median, because it requires a special tree 
>> implementation) it takes ~35ms. That's 157x speedup.
>> 
>> Levente
>
> Hello Levente,
>
> I am not surprised that I may not have the most efficient implementation. I 
> understand what you are saying in principle, but I don't understand how to 
> implement what you are saying. Can you provide an example doing what I did in 
> the manner you describe. I would greatly appreciate it. I would then run it 
> against my data for  a test.

Hi Jimmie,

if you load http://leves.web.elte.hu/squeak/LLRBT-ul.7.mcz [1] then you 
can evaluate the following example in a workspace. Just print it:

| rng matrix windowSize columnIndex old new oldTime newTime |
"Create a matrix filled with random float values between 0 and 1 with 10000 rows."
rng := Random new.
matrix := Matrix rows: 10000 columns: 5 tabulate: [ :i :j | rng next ].
windowSize := 500.
columnIndex := 5.
"The 'old' slow computation."
oldTime := [ old := Array new: matrix rowCount - windowSize streamContents: [ :stream |
  	1 to: matrix rowCount - windowSize + 1 do: [ :i |
 		| row |
 		row := (matrix atRows: i to: i + windowSize - 1 columns: 5 to: 5).
 		stream nextPut: {
 			row sum.
 			row max.
 			row min.
 			"row median."
 			row average } ] ] ] timeToRun.
"The 'new' fast computation."
newTime := [ new := Array new: matrix rowCount - windowSize streamContents: [ :stream |
 	| sum tree |
 	"Our variables holding the intermediate results."
 	sum := 0.
 	tree := LLRBTree new.
 	"Calculate the first values."
 	1 to: windowSize do: [ :rowIndex |
 		| element |
 		element := matrix at: rowIndex at: columnIndex.
 		sum := sum + element.
 		tree at: element put: element ].
 	stream nextPut: { sum. tree max. tree min. sum / windowSize }.
 	"And the rest."
 	windowSize + 1 to: matrix rowCount do: [ :rowIndex |
 		| oldElement newElement |
 		oldElement := matrix at: rowIndex - windowSize at: columnIndex.
 		newElement := matrix at: rowIndex at: columnIndex.
 		sum := sum - oldElement + newElement.
 		"Housekeeping for median would happen here"
 		tree
 			deleteKey: oldElement;
 			at: newElement put: newElement.
 		stream nextPut: { sum. tree max. tree min. sum / windowSize } ] ] ] timeToRun.
"Make sure the calculations give the same results."
old with: new do: [ :oldRow :newRow |
 	oldRow with: newRow do: [ :oldValue :newValue |
 		self assert: (oldValue closeTo: newValue) ] ].
"Print it to see the runtime"
{ oldTime. newTime }

The "old" version is your original code with minor changes, the "new" is 
the one that stores intermediate results. If you change the value of the 
matrix variable, you can test it on your data.

>
> The above was simply an example. I have many more methods which I've 
> implemented which are doing a variety of moving averages and such. To my 
> understanding, Squeak doesn't have the library of statistical methods at this 
> time. That would be one nice thing that could be done when Numpy becomes a C 
> lib and can be interfaced to from Squeak.
>
> I appreciate your comment above. I would really like to see Squeak out 
> perform some of the alternatives. :)

The same "trick" can be done in python, and I'm sure it has a sufficient 
balanced binary tree implementation for median too.

Levente

[1] The LLRBT package contains an implementation of Robert Sedgewick's 
Left-Leaning Red-Black Trees. The implementation lacks parent pointers for 
simplicity (and easy persistence adoption) which makes it a bit harder 
(but not impossible) to find the next and previous values of a node in 
O(log(n)) time. Also LLRBTree implements a dictionary-like protocol, so a 
tree node stores key-value pairs. Because of this, processing data with 
repeating values may result in data loss (since the tree can't hold the 
same key more than once).
If someone has a free (and complete) AVL or Red-Black tree implementation 
in Squeak, let me know. :)

>
> Thanks.
>
> Jimmie
>
>