[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )
Bob Arning
arning at charm.net
Wed Jan 30 19:22:56 UTC 2002
Yoel,
Here is some code that will compare two 450k-object (83 MByte) files in 22 minutes in about 22 meg of ram.
Cheers,
Bob
'From Squeak3.2gamma of 17 January 2002 [latest update: #4653] on 30 January 2002 at 2:10:22 pm'!
"Change Set: ldapTest
Date: 30 January 2002
Author: Bob Arning
Comparison for LDAP files
To create 2 test files,
LDAPObject buildTestFiles: 450*1000
To compare those files,
LDAPObject test2
Results on 400MHz mac: 450k items -- space used: 22,481,296, time: 22 minutes
"!
Object subclass: #LDAPObject
instanceVariableNames: 'attr '
classVariableNames: ''
poolDictionaries: ''
category: 'YoelTests'!
!LDAPObject commentStamp: '<historical>' prior: 0!
"Use the following lines to create an example LDIF file. Use CrLfFileStream if u got the wrong OS"
f _ FileStream fileNamed: 'ggg'.
f ascii.
1 to: 10000 do: [ :num | ns _ num asString.
f nextPutAll: ('dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen
')].
f close.
f _ FileStream fileNamed: 'ggg'.
f ascii.
lines _ f contentsOfEntireFile.
f close.
crcr _ String cr , String cr.
blocks _ lines findBetweenSubStrs: { crcr }.
objs _ blocks collect: [ :bl | LDAPObject fromStr: bl ].
!
LDAPObject class
instanceVariableNames: 'attributes '!
!LDAPObject methodsFor: 'accessing' stamp: 'YJ 1/28/2002 22:47'!
attrs
^ attr! !
!LDAPObject methodsFor: 'accessing' stamp: 'YJ 1/28/2002 22:46'!
attrs: aDict
attr _ aDict.
^ self! !
!LDAPObject methodsFor: 'accessing' stamp: 'RAA 1/30/2002 10:12'!
getUID
^(attr at: 'uid') first! !
!LDAPObject methodsFor: 'accessing' stamp: 'RAA 1/30/2002 10:13'!
listOfDifferencesFrom: another
| otherAttrs answer lineBegin |
otherAttrs _ another attrs.
answer _ WriteStream on: String new.
lineBegin _ '(',self getUID,') '.
attr keysAndValuesDo: [ :k1 :v1 |
(otherAttrs includesKey: k1) ifTrue: [
(otherAttrs at: k1) = v1 ifFalse: [
answer nextPutAll: lineBegin,'attribute ',k1,' is different in the two items'; cr.
].
] ifFalse: [
answer nextPutAll: lineBegin,'attribute ',k1,' missing from second item'; cr.
].
].
otherAttrs keysDo: [ :k1 |
(attr includesKey: k1) ifFalse: [
answer nextPutAll: lineBegin,'attribute ',k1,' missing from first item'; cr.
].
].
^answer contents! !
!LDAPObject class methodsFor: 'class initialization' stamp: 'RAA 1/30/2002 10:04'!
fromStr: str
| obj dict pairsKey pairsVal point |
obj _ self new.
dict _ Dictionary new.
str linesDo: [ :line |
point _ line findString: ': '.
point > 0 ifTrue: [
pairsKey _ line copyFrom: 1 to: (point - 1).
pairsVal _ line copyFrom: (point + 2) to: (line size).
(dict at: pairsKey ifAbsentPut: [OrderedCollection new]) add: pairsVal.
].
].
^obj attrs: dict! !
!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 10:14'!
buildIndexFor: aFileName
"
LDAPObject buildIndexFor: 'ggg'
"
| f answer |
f _ FileStream readOnlyFileNamed: aFileName.
f ascii.
[answer _ self buildIndexFrom: f] ensure: [f close].
^answer
! !
!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:10'!
buildIndexFrom: aStream
| answer key interim |
answer _ Dictionary new.
interim _ OrderedCollection new.
self
eachRecordFrom: aStream
do: [ :str :start |
key _ self parseForUID: str.
"try to keep answer in old space as much as possible"
interim add: (Association key: key value: {start. str size}).
interim size > 5000 ifTrue: [
interim do: [ :each | answer add: each].
interim _ OrderedCollection new.
].
].
interim do: [ :each | answer add: each].
^answer! !
!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:17'!
eachRecordFrom: aStream do: aBlock
| cr current ch sawCR gotOneItem temp start fillBuffer buffStart buffer count t1 t2 |
t1 _ Time millisecondClockValue.
count _ 0.
cr _ Character cr.
sawCR _ false.
current _ WriteStream on: (String new: 1000).
start _ aStream position.
fillBuffer _ [
buffStart _ aStream position.
buffer _ ReadStream on: (aStream next: 10000).
].
gotOneItem _ [
temp _ current contents.
temp isEmpty ifFalse: [aBlock value: temp value: start].
start _ buffer position + buffStart.
current reset.
(count _ count + 1) \\ 10000 = 0 ifTrue: [
t2 _ Time millisecondClockValue.
Transcript show: count printString,' - ',(t2 - t1) printString; cr.
t1 _ t2.
].
].
fillBuffer value.
[
ch _ buffer next ifNil: [
fillBuffer value.
ch _ buffer next ifNil: [^gotOneItem value].
].
ch == cr ifTrue: [
sawCR ifTrue: [
gotOneItem value.
] ifFalse: [
sawCR _ true.
current nextPut: ch.
].
] ifFalse: [
sawCR _ false.
current nextPut: ch.
].
true
] whileTrue.
gotOneItem value.! !
!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 12:52'!
parseForUID: aString
| point pairsKey |
aString linesDo: [ :line |
point _ line findString: ': '.
point > 0 ifTrue: [
pairsKey _ line copyFrom: 1 to: (point - 1).
pairsKey = 'uid' ifTrue: [
^line copyFrom: (point + 2) to: (line size)
].
].
].
! !
!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:25'!
test2
"
LDAPObject test2
MessageTally spyOn: [LDAPObject test2]
MessageTally spyOn: [LDAPObject buildIndexFor: 'ggg1']
"
| file1 file2 obj2 index1 errors uid2 obj1 comparison space1 space2 |
space1 _ Smalltalk garbageCollect.
index1 _ self buildIndexFor: 'ggg1'.
space2 _ Smalltalk garbageCollect. "ensure index in old space"
file1 _ FileStream readOnlyFileNamed: 'ggg1'.
file2 _ FileStream readOnlyFileNamed: 'ggg2'.
errors _ WriteStream on: String new.
errors nextPutAll: index1 size printString,' items, space used: ',
(space1 - space2) asStringWithCommas; cr; cr.
self
eachRecordFrom: file2
do: [ :str2 :start |
obj2 _ self fromStr: str2.
uid2 _ obj2 getUID.
obj1 _ nil.
index1
at: uid2
ifPresent: [ :tuple |
file1 position: tuple first.
obj1 _ self fromStr: (file1 next: tuple second).
index1 removeKey: uid2.
].
obj1 ifNil: [
errors nextPutAll: '(',uid2,') found in file2, but not in file1'; cr.
] ifNotNil: [
comparison _ obj1 listOfDifferencesFrom: obj2.
comparison isEmpty ifFalse: [
errors nextPutAll: comparison; cr.
].
].
].
index1 keysDo: [ :k |
errors nextPutAll: '(',k,') found in file1, but not in file2'; cr.
].
file1 close.
file2 close.
StringHolder new
contents: errors contents;
openLabel: 'error report'! !
!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 09:35'!
alternateStringFor: recNum
| ns |
ns _ recNum printString.
^'dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: somethingElse
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen
'! !
!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 13:25'!
buildTestFiles: objectCount
"
LDAPObject buildTestFiles: 450*1000
LDAPObject buildTestFiles: 200*1000
LDAPObject buildTestFiles: 10*1000
"
| f |
1 to: 2 do: [ :fileNum |
f _ FileStream newFileNamed: 'ggg',fileNum printString.
f ascii.
1 to: objectCount do: [ :recNum |
fileNum = 1 ifTrue: [
f nextPutAll: (self sampleStringFor: recNum)
] ifFalse: [ "make 2 a bit different"
recNum \\ 711 = 1 ifTrue: [
"skip a few records"
] ifFalse: [
recNum \\ 1201 = 1 ifTrue: [
f nextPutAll: (self alternateStringFor: recNum)
] ifFalse: [
f nextPutAll: (self sampleStringFor: recNum)
].
].
].
].
f close.
].! !
!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 09:30'!
sampleStringFor: recNum
| ns |
ns _ recNum printString.
^'dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen
'! !
LDAPObject class removeSelector: #buildTestFiles!
LDAPObject class removeSelector: #parseForDistinguishedName:!
!LDAPObject class reorganize!
('class initialization' fromStr:)
('comparing files' buildIndexFor: buildIndexFrom: eachRecordFrom:do: parseForUID: test2)
('creating sample files' alternateStringFor: buildTestFiles: sampleStringFor:)
!
More information about the Squeak-dev
mailing list
|