[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

danielv at netvision.net.il danielv at netvision.net.il
Fri Feb 1 13:55:37 UTC 2002


So merge whole records, not a problem. It's even not that important if
you only use one field to sort, because you'll make the problem much
smaller by eliminating many trivial matches. At which point, you can use
the existing, expensive, but precise algorithm.

See Richards very interesting description of this solution, though it
depend on being comfortable with the unix toolbox.

I think for this particular problem, if you need to do this task often,
changing the algorithm will give you great results. As a general rule,
and as a better answer to your challange, exploring what work was done
on using BDB might be more generally useful.

Have fun.
Daniel

Yoel Jacobsen <yoel at emet.co.il> wrote:
> 
> --Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)
> Content-type: text/plain; charset=us-ascii; format=flowed
> Content-transfer-encoding: 7BIT
> 
> This is not correct since I need to compare entry to entry not pair to 
> pair. Sort will only damage the LDIF files.
> 
>     Yoel
> 
> danielv at netvision.net.il wrote:
> 
> >>0. Any good idea about how to make it practical for 450K entries (18M 
> >>lines)? What should I  use for persistence?
> >>
> >
> >Assuming that the entries have to be string equal to be equal and thus
> >"not differences" and thus boring:
> >1. Use some generic sorting utility like unix 'sort' to sort both
> >inputs. They're pretty good at doing this for big files.
> >2. Do something akin to a phase in merge sort - read both files in a
> >synched manner. Any lines that match from both files, ignore. Any lines
> >without matchers, keep. If you have many matchers, don't keep in memory,
> >but write them to a file.
> >
> >This should be fast, and more useful than the python code.
> >
> >>    Thanks
> >>            Yoel
> >>
> >
> >Daniel
> >
> >
> >
> 
> 
> --Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)
> Content-type: text/html; charset=us-ascii
> Content-transfer-encoding: 7BIT
> 
> <html>
> <head>
> </head>
> <body>
> This is not correct since I need to compare entry to entry not pair to pair.
> Sort will only damage the LDIF files.<br>
> <br>
> &nbsp;&nbsp;&nbsp; Yoel<br>
> <br>
> <a class="moz-txt-link-abbreviated" href="mailto:danielv at netvision.net.il">danielv at netvision.net.il</a> wrote:<br>
> <blockquote type="cite" cite="mid:0GQQ00FGD4CDGX at mxout1.netvision.net.il">
>   <blockquote type="cite">
>     <pre wrap="">0. Any good idea about how to make it practical for 450K entries (18M <br>lines)? What should I  use for persistence?<br></pre>
>     </blockquote>
>     <pre wrap=""><!----><br>Assuming that the entries have to be string equal to be equal and thus<br>"not differences" and thus boring:<br>1. Use some generic sorting utility like unix 'sort' to sort both<br>inputs. They're pretty good at doing this for big files.<br>2. Do something akin to a phase in merge sort - read both files in a<br>synched manner. Any lines that match from both files, ignore. Any lines<br>without matchers, keep. If you have many matchers, don't keep in memory,<br>but write them to a file.<br><br>This should be fast, and more useful than the python code.<br><br></pre>
>     <blockquote type="cite">
>       <pre wrap="">    Thanks<br>            Yoel<br></pre>
>       </blockquote>
>       <pre wrap=""><!----><br>Daniel<br><br><br><br></pre>
>       </blockquote>
>       <br>
>       </body>
>       </html>
> 
> --Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)--



More information about the Squeak-dev mailing list