[Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks )

Bob Arning arning at charm.net
Wed Jan 30 19:22:56 UTC 2002


Yoel,

Here is some code that will compare two 450k-object (83 MByte) files in 22 minutes in about 22 meg of ram.

Cheers,
Bob

'From Squeak3.2gamma of 17 January 2002 [latest update: #4653] on 30 January 2002 at 2:10:22 pm'!
"Change Set:		ldapTest
Date:			30 January 2002
Author:			Bob Arning

Comparison for LDAP files

To create 2 test files,

	LDAPObject buildTestFiles: 450*1000

To compare those files,

	LDAPObject test2

Results on 400MHz mac: 450k items -- space used: 22,481,296, time: 22 minutes

"!

Object subclass: #LDAPObject
	instanceVariableNames: 'attr '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'YoelTests'!

!LDAPObject commentStamp: '<historical>' prior: 0!
"Use the following lines to create an example LDIF file. Use CrLfFileStream if u got the wrong OS"

f _ FileStream fileNamed: 'ggg'.
f ascii.
1 to: 10000 do: [ :num | ns _ num asString.
f nextPutAll: ('dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen

')].
f close.


f _ FileStream fileNamed: 'ggg'.
f ascii.
lines _ f contentsOfEntireFile.
f close.
crcr _ String cr , String cr.
blocks _ lines findBetweenSubStrs: { crcr  }.
objs _ blocks collect: [ :bl | LDAPObject fromStr: bl ].
!

LDAPObject class
	instanceVariableNames: 'attributes '!

!LDAPObject methodsFor: 'accessing' stamp: 'YJ 1/28/2002 22:47'!
attrs
	
	^ attr! !

!LDAPObject methodsFor: 'accessing' stamp: 'YJ 1/28/2002 22:46'!
attrs: aDict

	attr _ aDict.
	^ self! !

!LDAPObject methodsFor: 'accessing' stamp: 'RAA 1/30/2002 10:12'!
getUID

	^(attr at: 'uid') first! !

!LDAPObject methodsFor: 'accessing' stamp: 'RAA 1/30/2002 10:13'!
listOfDifferencesFrom: another

	| otherAttrs answer lineBegin |

	otherAttrs _ another attrs.
	answer _ WriteStream on: String new.
	lineBegin _ '(',self getUID,') '.
	attr keysAndValuesDo: [ :k1 :v1 |
		(otherAttrs includesKey: k1) ifTrue: [
			(otherAttrs at: k1) = v1 ifFalse: [
				answer nextPutAll: lineBegin,'attribute ',k1,' is different in the two items'; cr.
			].
		] ifFalse: [
			answer nextPutAll: lineBegin,'attribute ',k1,' missing from second item'; cr.
		].
	].
	otherAttrs keysDo: [ :k1 |
		(attr includesKey: k1) ifFalse: [
			answer nextPutAll: lineBegin,'attribute ',k1,' missing from first item'; cr.
		].
	].
	^answer contents! !


!LDAPObject class methodsFor: 'class initialization' stamp: 'RAA 1/30/2002 10:04'!
fromStr: str 
	| obj dict pairsKey pairsVal point |

	obj _ self new.
	dict _ Dictionary new.
	str linesDo: [ :line |
		point _ line findString: ': '.
		point > 0 ifTrue: [
			pairsKey _ line copyFrom: 1 to: (point - 1).
			pairsVal _ line copyFrom: (point + 2) to: (line size).
			(dict at: pairsKey ifAbsentPut: [OrderedCollection new]) add: pairsVal.
		].
	].
	^obj attrs: dict! !

!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 10:14'!
buildIndexFor: aFileName
"
LDAPObject buildIndexFor: 'ggg'
"
	| f answer |

	f _ FileStream readOnlyFileNamed: aFileName.
	f ascii.
	[answer _ self buildIndexFrom: f] ensure: [f close].
	^answer
! !

!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:10'!
buildIndexFrom: aStream

	| answer key interim |

	answer _ Dictionary new.
	interim _ OrderedCollection new.
	self 
		eachRecordFrom: aStream 
		do: [ :str :start |
			key _ self parseForUID: str.

			"try to keep answer in old space as much as possible"
			interim add: (Association key: key value: {start. str size}).
			interim size > 5000 ifTrue: [
				interim do: [ :each | answer add: each].
				interim _ OrderedCollection new.
			].
		].
	interim do: [ :each | answer add: each].
	^answer! !

!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:17'!
eachRecordFrom: aStream do: aBlock

	| cr current ch sawCR gotOneItem temp start fillBuffer buffStart buffer count t1 t2 |

	t1 _ Time millisecondClockValue.
	count _ 0.
	cr _ Character cr.
	sawCR _ false.
	current _ WriteStream on: (String new: 1000).
	start _ aStream position.
	fillBuffer _ [
		buffStart _ aStream position.
		buffer _ ReadStream on: (aStream next: 10000).
	].
	gotOneItem _ [
		temp _ current contents.
		temp isEmpty ifFalse: [aBlock value: temp value: start].
		start _ buffer position + buffStart.
		current reset.
		(count _ count + 1) \\ 10000 = 0 ifTrue: [
			t2 _ Time millisecondClockValue.
			Transcript show: count printString,' - ',(t2 - t1) printString; cr.
			t1 _ t2.
		].
	].
	fillBuffer value.
	[
		ch _ buffer next ifNil: [
			fillBuffer value.
			ch _ buffer next ifNil: [^gotOneItem value].
		].
		ch == cr ifTrue: [
			sawCR ifTrue: [
				gotOneItem value.
			] ifFalse: [
				sawCR _ true.
				current nextPut: ch.
			].
		] ifFalse: [
			sawCR _ false.
			current nextPut: ch.
		].
		true
	] whileTrue.
	gotOneItem value.! !

!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 12:52'!
parseForUID: aString

	| point pairsKey |

	aString linesDo: [ :line |
		point _ line findString: ': '.
		point > 0 ifTrue: [
			pairsKey _ line copyFrom: 1 to: (point - 1).
			pairsKey = 'uid' ifTrue: [
				^line copyFrom: (point + 2) to: (line size)
			].
		].
	].
! !

!LDAPObject class methodsFor: 'comparing files' stamp: 'RAA 1/30/2002 13:25'!
test2
"
LDAPObject test2
MessageTally spyOn: [LDAPObject test2]
MessageTally spyOn: [LDAPObject buildIndexFor: 'ggg1']
"
	| file1 file2 obj2 index1 errors uid2 obj1 comparison space1 space2 |

	space1 _ Smalltalk garbageCollect.
	index1 _ self buildIndexFor: 'ggg1'.
	space2 _ Smalltalk garbageCollect.	"ensure index in old space"
	file1 _ FileStream readOnlyFileNamed: 'ggg1'.
	file2 _ FileStream readOnlyFileNamed: 'ggg2'.
	errors _ WriteStream on: String new.
	errors nextPutAll: index1 size printString,' items, space used: ',
				(space1 - space2) asStringWithCommas; cr; cr.
	self 
		eachRecordFrom: file2 
		do: [ :str2 :start |
			obj2 _ self fromStr: str2.
			uid2 _ obj2 getUID.
			obj1 _ nil.
			index1 
				at: uid2 
				ifPresent: [ :tuple |
					file1 position: tuple first.
					obj1 _ self fromStr: (file1 next: tuple second).
					index1 removeKey: uid2.
				].
			obj1 ifNil: [
				errors nextPutAll: '(',uid2,') found in file2, but not in file1'; cr.
			] ifNotNil: [
				comparison _ obj1 listOfDifferencesFrom: obj2.
				comparison isEmpty ifFalse: [
					errors nextPutAll: comparison; cr.
				].
			].
		].
	index1 keysDo: [ :k |
		errors nextPutAll: '(',k,') found in file1, but not in file2'; cr.
	].
	file1 close.
	file2 close.

	StringHolder new
		contents: errors contents;
		openLabel: 'error report'! !

!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 09:35'!
alternateStringFor: recNum

	| ns |
	ns _ recNum printString.
	^'dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: somethingElse
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen

'! !

!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 13:25'!
buildTestFiles: objectCount
"
LDAPObject buildTestFiles: 450*1000
LDAPObject buildTestFiles: 200*1000
LDAPObject buildTestFiles: 10*1000
"
	| f  |

	1 to: 2 do: [ :fileNum |
		f _ FileStream newFileNamed: 'ggg',fileNum printString.
		f ascii.
		1 to: objectCount do: [ :recNum | 
			fileNum = 1 ifTrue: [
				f nextPutAll: (self sampleStringFor: recNum)
			] ifFalse: [	"make 2 a bit different"
				recNum \\ 711 = 1 ifTrue: [
					"skip a few records"
				] ifFalse: [
					recNum \\ 1201 = 1 ifTrue: [
						f nextPutAll: (self alternateStringFor: recNum)
					] ifFalse: [
						f nextPutAll: (self sampleStringFor: recNum)
					].
				].
			].
		].
		f close.
	].! !

!LDAPObject class methodsFor: 'creating sample files' stamp: 'RAA 1/30/2002 09:30'!
sampleStringFor: recNum

	| ns |
	ns _ recNum printString.
	^'dn: uid=yoel' , ns , ', ou=people, o=emet.co.il
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
uid: yoel' , ns , '
cn: yoel' , ns , '
sn: jacobsen

'! !

LDAPObject class removeSelector: #buildTestFiles!
LDAPObject class removeSelector: #parseForDistinguishedName:!

!LDAPObject class reorganize!
('class initialization' fromStr:)
('comparing files' buildIndexFor: buildIndexFrom: eachRecordFrom:do: parseForUID: test2)
('creating sample files' alternateStringFor: buildTestFiles: sampleStringFor:)
!




More information about the Squeak-dev mailing list