[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Yoel Jacobsen yoel at emet.co.il
Wed Jan 30 07:51:35 UTC 2002


Scott,

    Here is a small explanation

    A simple LDIF entry:
       
        dn: uid=yoel, ou=people, o=xyz.com
        objectclass: top
        objectclass: person
        objectclass: organziationalPerson
        objectclass: inetOrgPerson
        uid: yoel
        cn: yoel
        sn: jacobsen
        mail: yoel at emet.co.il

   This is one complete entry. Each entry is composed from multiple 
key-value pairs (called the entry's attributes).The name dn stands for 
'distinguished name' and it is actually the key of the entire entry in 
the LDAP directory. It is actually the path of the enrtry in the LDAP 
hierarchy meaning that uid=yoel (this entry) is placed under a branch 
point named ou=people placed under a root entry named o=xyz.com.

    In real life, entries are much more complicated. In my case, each 
entry is composed of about 40 such key-value pairs and such big entries 
should be compared.

    My python program (and future squeak program!) compares complete 
entries. It finds if the path changed, if an attribute name or value 
changed or if the entry does not exists at all. Since I need to find the 
match entry in the destination database for each entry of the source 
database, I need to place these database somewhere. I don't want to 
search the destination database linearly for every source entry. This 
will not use a lot of memory but will take forever. This is why I wanted 
to build a memory or disk based database which maps each dn to the 
complete pre parsed entry

    Please complete your educated analysis from here.

Scott A Crosby wrote:

>On Tue, 29 Jan 2002, Yoel Jacobsen wrote:
>
>>We suspected that the migration process had mistakes (in the analysis
>>and implementation). We wanted to compare the current and old LDAP
>>directories.
>>
>
>There seem to be at least a few algorithms-related errors made.
>
>>What I did with Python
>>================
>>After I realized that analyzing it in memory is impractical (I had a
>>machine with 4GB of RAM for this and it was not enough for holding the
>>two databases 'in the air'), I was using Berkeley DB for storing databases.
>>
>
>What is the average size (in bytes) of an entry. What is the average
>size(in bytes) of the value (non-key) part of an entry?
>
>If you can fit it in $100 of RAM, you're probably in better shape spending
>the $100 then fooling with it for a day or two. FOr something like this,
>you should be able to store it in less RAM than a textfile representation
>itself would need. One trick: Make sure duplicate keys are only held
>once in RAM. If both keys and values are duplicated a lot, you can save a
>lot more. (I'd need a realistic data sample to judge.)
>
>So, make sure that that *isn't* possible.
>
>>The algorithm in general:
>>    1. Make a database from the two 18M lines LDIF files (resulting in
>>about 800MB database files). What I did here is to create an
>>LDAPObject instance (see attached squeak file) and store it serialized
>>with it's DN as the key.
>>
>
>Make sure the DN's are not random attributs and not uncorrelated between
>the datasets. When serializing data like this, serialize it based on some
>semi-unique key within the data. Say, by 'name' or 'email'.
>
>Do this for both datasets, then, they'll be in roughly the same order.
>
>Then you can stream data from both files, cache any entry that you haven't
>used in RAM, and as soon as you find a match, 'forget' about both entries.
>(and write a log-line indicating the match). Now you're basically doing
>optimized LCS-style matching. Squeak never needs to remember anything
>except the additions/differences between both datasets.
>
>>    2. For each entry in the source, find the entry in the destination.
>>If not exists - report and try with the next entry
>>
>
>Object persistence doesn't help you one whit. If you actually are going to
>run out of memory, and your cache hit rate is low. You're going to take a
>20ms disk-hit doing a lookup on each entry, which comes out at 3 hours
>worth of disk latency. (20ms * 500k)
>
>Nothing is going to stop that.
>
>You might as well load em into memory and let the OS swapping daemon do
>that for you.
>
>Which is why you should not be focussing on persistence, which won't help.
>Instead try to work on how to avoid that. Either with more RAM or better
>algorithms.
>
>>My Squeak tests
>>============
>>
>>Parsing 10K lines took me about 12 minutes in which the image was
>>working on this as a single task. Profiling shoes the time is mostly
>>spent on adding to collections.
>>
>
>Which part of collection? :)
>
>Care to offer a bet that it was spending 90% of its time doing
>String>>hash or String>>=? :)
>
>With an extra 16 bytes/string, that can be dropped by about 80-98%, and
>allow sharing.
>
>Ah, when building the data, String>>, is very expensive when doing
>multiple concatenations, try   String>>streamContents:
>for building. That may be where you got the 120 seconds from.
>
>
>>Questions
>>=========
>>0. Any good idea about how to make it practical for 450K entries (18M
>>lines)? What should I  use for persistence?
>>
>
>Avoid disk-based persistence when it won't help. There's nothing magical
>about it making random key lookups any faster.
>
>Scott
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20020130/a99afa16/attachment.htm


More information about the Squeak-dev mailing list