[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Tue Jan 29 20:02:27 UTC 2002

> 0. Any good idea about how to make it practical for 450K entries (18M 
> lines)? What should I  use for persistence?

Assuming that the entries have to be string equal to be equal and thus
"not differences" and thus boring:
1. Use some generic sorting utility like unix 'sort' to sort both
inputs. They're pretty good at doing this for big files.
2. Do something akin to a phase in merge sort - read both files in a
synched manner. Any lines that match from both files, ignore. Any lines
without matchers, keep. If you have many matchers, don't keep in memory,
but write them to a file.

This should be fast, and more useful than the python code.

>     Thanks
>             Yoel

Daniel