[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Tue Jan 29 18:21:46 UTC 2002

On Tue, 29 Jan 2002, Yoel Jacobsen wrote:

>
> We suspected that the migration process had mistakes (in the analysis
> and implementation). We wanted to compare the current and old LDAP
> directories.
>

There seem to be at least a few algorithms-related errors made.

> What I did with Python
> ================
> After I realized that analyzing it in memory is impractical (I had a
> machine with 4GB of RAM for this and it was not enough for holding the
> two databases 'in the air'), I was using Berkeley DB for storing databases.
>

What is the average size (in bytes) of an entry. What is the average
size(in bytes) of the value (non-key) part of an entry?

If you can fit it in $100 of RAM, you're probably in better shape spending
the $100 then fooling with it for a day or two. FOr something like this,
you should be able to store it in less RAM than a textfile representation
itself would need. One trick: Make sure duplicate keys are only held
once in RAM. If both keys and values are duplicated a lot, you can save a
lot more. (I'd need a realistic data sample to judge.)

So, make sure that that *isn't* possible.

> The algorithm in general:
>     1. Make a database from the two 18M lines LDIF files (resulting in
> about 800MB database files). What I did here is to create an
> LDAPObject instance (see attached squeak file) and store it serialized
> with it's DN as the key.

Make sure the DN's are not random attributs and not uncorrelated between
the datasets. When serializing data like this, serialize it based on some
semi-unique key within the data. Say, by 'name' or 'email'.

Do this for both datasets, then, they'll be in roughly the same order.

Then you can stream data from both files, cache any entry that you haven't
used in RAM, and as soon as you find a match, 'forget' about both entries.
(and write a log-line indicating the match). Now you're basically doing
optimized LCS-style matching. Squeak never needs to remember anything
except the additions/differences between both datasets.

>     2. For each entry in the source, find the entry in the destination.
> If not exists - report and try with the next entry

Object persistence doesn't help you one whit. If you actually are going to
run out of memory, and your cache hit rate is low. You're going to take a
20ms disk-hit doing a lookup on each entry, which comes out at 3 hours
worth of disk latency. (20ms * 500k)

Nothing is going to stop that.

You might as well load em into memory and let the OS swapping daemon do
that for you.

Which is why you should not be focussing on persistence, which won't help.
Instead try to work on how to avoid that. Either with more RAM or better
algorithms.

> My Squeak tests
> ============
>
> Parsing 10K lines took me about 12 minutes in which the image was
> working on this as a single task. Profiling shoes the time is mostly
> spent on adding to collections.
>

Which part of collection? :)

Care to offer a bet that it was spending 90% of its time doing
String>>hash or String>>=? :)

With an extra 16 bytes/string, that can be dropped by about 80-98%, and
allow sharing.

Ah, when building the data, String>>, is very expensive when doing
multiple concatenations, try   String>>streamContents:
for building. That may be where you got the 120 seconds from.

> Questions
> =========
> 0. Any good idea about how to make it practical for 450K entries (18M
> lines)? What should I  use for persistence?

Avoid disk-based persistence when it won't help. There's nothing magical
about it making random key lookups any faster.

Scott