[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Bijan Parsia bparsia at email.unc.edu
Tue Jan 29 18:54:44 UTC 2002


On Tue, 29 Jan 2002, Scott A Crosby wrote:

> On Tue, 29 Jan 2002, Yoel Jacobsen wrote:
[snip]
> Ah, when building the data, String>>, is very expensive when doing
> multiple concatenations, try   String>>streamContents:
> for building. That may be where you got the 120 seconds from.
[snip]

Or using a WriteStream directly, which you can tune a bit better.

Studying OrderedCollection>>add: can be instructive. Remember that a lot
of these streaming like stuff have to do a copy to twice as big collection
if you hit the bounds. If you can presize your collections (or reuse
them) you should do better. 

(You can tune your stream by streaming onto a collection of roughly the
right size, or greater. Check out WriteStream>>pastEndPut: for what
happens if you try to write past the end of your collection, to wit:

...	collection := collection , (collection class new: 
			((collection size max: 20) min: 20000)).
...

So, the best you can do, here, is adding 20000 slots at a time. Which
means that, every 20000 or so, you have:

	1) the old collection
	2) a new, empty 20000 entry collection
	3) then the concatination of the two, 
	3.5) which includes all the copying that requires, if any
	4) then you have to gc 1, 2, and 3.5

If the old collection is very large already, but still growing, you'll
have to GC it a number of times. Yuck.

Whereas, 
	WriteStream on: (YourTargetCollection new: fullsize) 

should allocate the TargetCollection once, at the start.

(If you can/must use OrderedCollections,
	OrderedCollection new: fullsize
should work too. I *suspect* that streaming is a touch faster, but I
don't have any measurements.)

Of course, the streaming protocol works with FileStreams too, so if you
can write out your results as you go along, you can test with small sets
in memory, and keep cycling out results in the real thing (which would
work better if you were using buffered FileStreams which I know are in
flow, and aren't all that hard to build for your own stuff; but here, the
$100 of memory might do the trick :)).

Cheers,
Bijan Parsia.




More information about the Squeak-dev mailing list