[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Wed Jan 30 22:49:38 UTC 2002

Yoel Jacobsen <yoel at emet.co.il> wrote:
	    Here is a small explanation

	    A simple LDIF entry:

	        dn: uid=yoel, ou=people, o=xyz.com
	        objectclass: top
	        objectclass: person
	        objectclass: organziationalPerson
	        objectclass: inetOrgPerson
	        uid: yoel
	        cn: yoel
	        sn: jacobsen
	        mail: yoel at emet.co.il

An entry has a key (o,ou,uid) and a value;
the value is a mapping from attribute names to strings or string lists.

A collection of LDIF entries represents a mapping from LDIF keys
to entire LDIF entries.

If you sort such a collection, it still represents the SAME mapping
from LDIF keys to entire LDIF entries.

So, in order to compare two LDIF collections A, B:
    (1) Sort A (on disc) using (o,ou,dn) as key, giving SA.
	You can use the UNIX sort(1) command to do this, although
	you might need to massage the data a little first to convert
	each entry to a single line and to bring a suitably transformed
	key to the beginning of the line.  You will want to ensure that
	the entry is in some normal form; if, as I suspect, the order
	of the attributes does not matter (except for multiple occurrences
	of the same attribute, like objectclass:), then a stable sort of
	the attribute: value pairs will do nicely.  If the order of
	different attributes DOES matter, then of course do nothing.

	Of course we don't actually _need_ SA as a disc file, reading
	the output of the sort command using a pipe will do.

    (2) Sort B (on disc) using (o,ou,dn) as key, giving SB.
	Once again, reading the output of sort(1) using a pipe will do.

    (3) Do a merge on (the streams reading from) SA and SB.
	You hold *ONE* entire entry (with key) from SA
	and hold *ONE* entire entry (with key) from SB
	in memory at a time.

	We're talking about a modified version of the comm(1) command.
	If a key is present in SA but not SB, the entire entry must be
	    missing from SB.
	If a key is present in SA but not SB, the entire entry must be
	    missing from SA.
	If a key is present in SA and SB, 
	    if the entire entries are identical (one string comparison),
	    no worries.
	    If there is any difference, analyse the difference and report it.

As an old AWK hacker, I'd have
	awk -f massage.awk A | sort >SA
	awk -f massage.awk B | sort >SB
	modified-comm SA SB

Note that the two awk processes each need only one LDIF entry in memory
at a time, and the modified-comm process needs only two LDIF entries in
memory at a time.  The only processes that can really take advantage of
lots of memory are the sort processes.

Ideally, there wouldn't be a "massage" step, the programs that generate
the LDIF files should be able to write out something like

    xyz.com people yoel<TAB>\
    cn:yoel<TAB>mail:yoel at emet.co.il<TAB>\
    objectclass:top person organizationalPerson inetOrgPerson<TAB>\
    sn:jacobsen<TAB>uid:yoel<NL>

where <TAB> represents a tab character, \ simply indicates continuation
on the next line in this message and neither it nor the following white
space would actually be in the record, and <NL> stands for the local end
of line convention.  Here I have assumed that the order of different
attributes doesn't matter, and that attribute values cannot contain spaces.

Indeed, I would expect any decent database dumper to have an option to
dump the database in sorted order in the first place.

	    In real life, entries are much more complicated. In my case, each 
	entry is composed of about 40 such key-value pairs and such big entries 
	should be compared.

	    My python program (and future squeak program!) compares complete 
	entries. It finds if the path changed, if an attribute name or value 
	changed or if the entry does not exists at all.

Since the dn is called the 'key' of an entry, how can you tell if an
entry's path (dn) has changed without having some _other_ unique name
for an entry?  If there is some other unique name, sort using that as
a key instead.  If there isn't some other unique name for an entry,
then you can't tell if the dn has changed, or if one entry has disappeared
and an otherwise similar one been created.

	Since I need to find the 
	match entry in the destination database for each entry of the source 
	database, I need to place these database somewhere. I don't want to 
	search the destination database linearly for every source entry.

This is precisely what COBOL's "MERGE" verb was provided for many decades
ago.  Any good data structures and algorithms textbook will discuss the
"merge" design pattern.

Wild guesses on size:
    40 attribute: value pairs @ 60 characters each = 2 400 bytes
    450 000 entries at 2 400 bytes each = 600 000 000 bytes.
To hold A, B, SA, SB on disc will require 2.4 GB of disc, which is
easily obtained these days; more workspace will be needed during
sorting, so 4 GB of space at the beginning would be nice.

Of course, stuff like this is screaming out for compression.
If you have a fixed set of attributes, you don't actually need to
express the attribute names.  Some of the values (top, person, and
perhaps organizationalPerson, and similar objectclass values) are
going to be repeated very very often.  There are other kinds of
redundancy visible as well.  This suggests

    massage-and-compress A | sort >SA
    massage-and-compress B | sort >SB
    merge-and-decompress SA SB

where we only really have to decompress the anomalous entries.
I'm talking about within-entry compression here, not whole-file compression,
so we wouldn't expect a factor of 10 space reduction, but a factor of 2 or 3
seems attainable.

But I'd only worry about compression if I didn't have 4 GB handy.
The sort, sort, merge pattern is the trick.