Pipe syntax and the current methods

Mon Aug 27 20:38:35 UTC 2007

> With pipes, this could be written as
>
>    highestNumberedChangeSet
>        "ChangeSorter highestNumberedChangeSet"
>        ^self allChangeSetNames
>            select:[:aString | aString startsWithDigit]  ;;
>            collect:[:aString | aString initialIntegerOrNil] ;;
>            ifNotEmpty:[:list | list max]
>
With pipe objects using standard smalltalk syntax, this could be written as:

highestNumberedChangeSet
     "ChangeSorter highestNumberedChangeSet"
       ^self allChangeSetNames asPipe
           selecting: [:aString | aString startsWithDigit];
           collecting: [:aString | aString initialIntegerOrNil];
           ifNotEmpty: [:list | list max]
> which, with pipes, could be rewritten as...
>
>  self systemNavigation
>    allCallsOn: assoc ;;
>    collect: [:each | each classSymbol] ;;
>    asSet ;;
>    do: [:clsName | (Smalltalk at: clsName) replaceSilently: oldName 
> to: aName].
>
And again:

(self systemNavigation allCallsOn: assoc) asPipe
   collecting: [:each | each classSymbol];
   selectingAsUniqueSet;
   do: [:clsName | (Smalltalk at: clsName) replaceSilently: oldName to: 
aName]

I guess I should point out that work like this has been done in 
VisualWorks from two fronts - StreamWrappers and ComputingStreams. Both 
packages are available in public store.

Pipes are a great idea - streams talking to streams is the only way to 
do efficient large-data-set programming (eg: google's map+reduce technique).

I wish more of Smalltalk were written with this approach in mind, it'd 
scale without effort then.. and programmers wouldn't accidently create 
memory explosion bottlenecks without trying. Multiple select:, collect:, 
reject: calls on large data sets will bring any image to its knees in 
seconds if more than one concurrent user invokes the same sort of 
operation at once.

The speed issue comes not from the time it takes to do one call of this 
- but what happens when multiple processes try to do the same thing (eg: 
multiple users hitting your server at once). And the speed issue comes 
in not from computing CPU cycles, but from memory allocation and garbage 
collection.

If you start with a collection of 100,000 things, you do 4 operations on 
it - three collect:'s and a reject:.. you'll probably be allocating 4 
arrays of 100,000 slots. That's 1.2mb's of data you've just allocated. 
Now get 10 users using the same function at the same time and you've 
just made 12mb's of data. Scale it up a little more elaborate chains of 
functions or more users and you have serious scalability issues.

Now to put the point home - if you are generating web pages from the 
server. You start with a parse of a node tree which concatenates dozens 
of little strings together to produce the page - which pushes it through 
a zip library - which pushes it through a chunks stream - perhaps 
there's a utf8 encoder in there too. Unless all those streams are using 
a cyclic buffer, streams to streams, they're going to be generating LOTS 
of small and big strings as they build up their own internal streams and 
buffers.

Anyway.. just food for thought. At Wizard we spent a considerably amount 
of time optimizing our Http/1.1 server to deal with exactly this sort of 
thing. We also found we could use the same code for database operations 
too (we were using BerkeleyDB as our database, so querying was done by us).

Cheers,
Michael