[Challenge] large files smart compare (was: Re: Squeak for I/O and
Memory Intensive tasks )
danielv at netvision.net.il
danielv at netvision.net.il
Tue Jan 29 20:02:27 UTC 2002
> 0. Any good idea about how to make it practical for 450K entries (18M
> lines)? What should I use for persistence?
Assuming that the entries have to be string equal to be equal and thus
"not differences" and thus boring:
1. Use some generic sorting utility like unix 'sort' to sort both
inputs. They're pretty good at doing this for big files.
2. Do something akin to a phase in merge sort - read both files in a
synched manner. Any lines that match from both files, ignore. Any lines
without matchers, keep. If you have many matchers, don't keep in memory,
but write them to a file.
This should be fast, and more useful than the python code.
> Thanks
> Yoel
Daniel
More information about the Squeak-dev
mailing list
|