faster Celeste IndexFile's

Lex Spoon lex at cc.gatech.edu
Fri Feb 12 04:16:31 UTC 1999


I've noticed Celeste gets quite slow reading and writing its index file when you have a large  mail database; further, it does this moderately frequently: at startup, shutdown, whenever you POP, and whenever you send out your mail.

Here is a simple changeset that makes this much faster (I'm seeing loads and saves in less than half the time than they used to take).  The main thing it does is add an instance variable to IndexFileEntry to cache the on-disk version of an IndexFileEntry; thus saves and loads can usually just read/write this cache, instead of actually having to encode or decode a bunch of integers.

This does require a slightly new on-disk format for the index file.  Loading an old-style index file will still work, but it won't be any faster than it was before.  The first time you save, however, the index file will be saved in the new format; subsequent loads and saves will get the speed boost from this changeset.

If you ever want to change back to using the old ASCII formatted messages, simply remove the first line from IndexFile>>writeOn: and save your index file again.


Cheers!

Lex


PS -- this would be a great application for a simple on-disk database that can handle deletions.  I just don't know enough about databases to set such a thing up....

PPS -- I think the final return value in PositionableStream>>match: is wrong.  Apparently no code in the image that used it actually checked the return value, so it didn't matter before.  This changeset fixes it.



'From Squeak 2.3 of January 14, 1999 on 11 February 1999 at 10:51:01 pm'!
"Change Set:		FasterCeleste
Date:			11 February 1999
Author:			Lex Spoon

Modifies the loading and saving of IndexFile's to be much faster, by not completely parsing IndexFileEntries until they are accessed.  Furthermore, the sorted list of the entire index file is no longer kept, so that entries indeed don't have to be accessed until the user visits a message category referencing that entry.  The sorting is now done in filteredMessagesIn: in Celeste.

Additionally, there are changes two things outside of Celeste:

	1) (fix) PositionableStream>>match: always returns true now if the match succeeds
	2) (speedup)  RWBinaryOrTextStream>>next: uses replaceFrom:... instead of a to:do: loop

"!

Object subclass: #IndexFileEntry
	instanceVariableNames: 'messageFile msgID location textLength time from to cc subject tocLineCache serializedCache '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Interface-Mail Reader'!

!Celeste methodsFor: 'filtering' stamp: 'ls 2/6/1999 08:06'!
filteredMessagesIn: categoryName

	| msgList |
	msgList _ (mailDB messagesIn: categoryName) collect: [ :id | mailDB getTOCentry: id ].

	(customFilterBlock notNil) ifTrue:
		[msgList _ msgList select:
			[: msg | customFilterBlock value: msg ]].
	(fromFilter size > 0) ifTrue:
		[msgList _ msgList select:
			[: msg | msg from includesSubstring: fromFilter caseSensitive: false]].
	(subjectFilter size > 0) ifTrue:
		[msgList _ msgList select:
			[: msg | msg subject includesSubstring: subjectFilter caseSensitive: false]].
	
	msgList _ msgList asSortedCollection: [ :msg1 :msg2 | msg1 time < msg2 time ].

	^msgList asArray collect: [ :msg | msg msgID ]! !


!IndexFile methodsFor: 'file operations' stamp: 'ls 2/5/1999 15:07'!
openOn: aFileName messageFile: messageFile
	"Initialize myself from the file with the given name."

	| fileStream |
	filename _ aFileName.
	fileStream _ (FileStream fileNamed: aFileName) binary.
	self readFrom: (RWBinaryOrTextStream with: fileStream contentsOfEntireFile) reset messageFile: messageFile.

		"close and release the file stream"
	fileStream _ nil.! !

!IndexFile methodsFor: 'dictionary access' stamp: 'ls 2/6/1999 07:44'!
at: msgID put: anIndexFileEntry
	"Associate the given IndexFileEntry with the given message ID."

	"timeSortedEntries removeAllSuchThat:
		[: assoc | assoc key = msgID].	"  "don't duplicate the entry!!"
	msgDictionary at: msgID put: anIndexFileEntry.
	"timeSortedEntries add: (Association key: msgID value: anIndexFileEntry)."! !

!IndexFile methodsFor: 'dictionary access' stamp: 'ls 2/6/1999 07:52'!
keys
	"Answer a collection of message IDs for all the messages in this IndexFile."

	^msgDictionary keys
! !

!IndexFile methodsFor: 'dictionary access' stamp: 'ls 2/6/1999 07:45'!
remove: msgID
	"Remove the entry with the given ID from my Dictionary."

	"timeSortedEntries removeAllSuchThat: [: assoc | assoc key = msgID]."
	msgDictionary removeKey: msgID ifAbsent: [].! !

!IndexFile methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:46'!
readBinaryFrom: aStream messageFile: messageFile
	"Initialize myself from the given text stream. It is assumed that the .index file was written in order of ascending message timestamps, although this method is only less efficient, not incorrect, if this is not the case."

	| sorted lastTime msgID entry |

	msgDictionary _ Dictionary new: 1000.
	"timeSortedEntries _ (SortedCollection new: 1000) sortBlock:
		[: m1 : m2 | m1 value time <= m2 value time]."
	sorted _ true.
	lastTime _ nil.
	[aStream atEnd] whileFalse:
		[aStream binary.
		(aStream next = $* asInteger) ifFalse: [ ^self error: 'error in index file -- delete it and do a compact operation' ].
		msgID _ aStream nextInt32.
		entry _ IndexFileEntry readBinaryFrom: aStream messageFile: messageFile  msgID: msgID.

		 msgDictionary at: msgID put: entry.
		" timeSortedEntries addLast: (Association key: msgID value: entry).
		 ((sorted & lastTime notNil) and: [lastTime > entry time]) ifTrue:
			[sorted _ false].
		 lastTime _ entry time"].
"	sorted ifFalse: [timeSortedEntries reSort]." 	"re-sort if necessary" ! !

!IndexFile methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:47'!
readFrom: aStream messageFile: messageFile
	"Initialize myself from the given text stream. It is assumed that the .index file was written in order of ascending message timestamps, although this method is only less efficient, not incorrect, if this is not the case."

	| msgID entry |

	(aStream ascii match: 'CELESTE/IF/1.0') ifTrue: [
		^self readBinaryFrom: aStream binary messageFile: messageFile ].

	msgDictionary _ Dictionary new: 1000.


	[aStream atEnd] whileFalse:
		[msgID _ MailDB readIntegerLineFrom: aStream.
		 entry _ IndexFileEntry readFrom: aStream messageFile: messageFile msgID: msgID.
		 msgDictionary at: msgID put: entry.
	].! !

!IndexFile methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:45'!
writeBinaryOn: aStream
	"Write my index entries to the given text stream in a binary form."
	"Note: For efficiency, this is done in order of increasing message timestamps, to save the cost of sorting when we read it back in. It is assumed that timeSortedEntries should contains exactly the same message ID's as msgDictionary."

	aStream binary.

	aStream nextPutAll: 'CELESTE/IF/1.0'.



	 msgDictionary associationsDo:
		[: assoc |
			"self assert: (assoc key class == SmallInteger)."
			aStream nextPut: $* asInteger.
			aStream nextInt32Put: assoc key.
		 (assoc value) writeBinaryOn: aStream].	"index entry"! !

!IndexFile methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:45'!
writeOn: aStream
	"Write my index entries to the given text stream in human-readable form."
	"Note: For efficiency, this is done in order of increasing message timestamps, to save the cost of sorting when we read it back in. It is assumed that timeSortedEntries should contains exactly the same message ID's as msgDictionary."

	true ifTrue: [ ^self writeBinaryOn: aStream ].

	msgDictionary associationsDo:
		[: assoc |
		 (assoc key) printOn: aStream.		"message ID"
		 aStream cr.
		 (assoc value) writeOn: aStream].	"index entry"! !

!IndexFile methodsFor: 'private' stamp: 'ls 2/7/1999 12:09'!
uncachedCount
	^(msgDictionary select: [ :entry | entry isUncached]) size! !


!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
cc
	self unserialize.
	^cc! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
cc: aString
	self unserialize.
	serializedCache _ nil.
	cc _ aString.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
date
	"Answer a date string for this index entry."
	self unserialize.
	^Date fromDays: (time + (Date newDay: 1 year: 1980) asSeconds) // 86400! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
from
	self unserialize.
	^from! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
from: aString
	self unserialize.
	serializedCache _ nil.
	from _ aString.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
location
	self unserialize.
	^location! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:32'!
location: anInteger
	self unserialize.
	serializedCache _ nil.
	location _ anInteger.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
subject
	self unserialize.
	^subject! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
subject: aString
	self unserialize.
	serializedCache _ nil.
	subject _ aString.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
textLength
	self unserialize.
	^textLength! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
textLength: anInteger
	self unserialize.
	serializedCache _ nil.
	textLength _ anInteger.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
time
	self unserialize.
	
	^time! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
time: anInteger
	self unserialize.
	serializedCache _ nil.
	time _ anInteger.! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
to
	self unserialize.
	^to! !

!IndexFileEntry methodsFor: 'access' stamp: 'ls 2/6/1999 07:33'!
to: aString
	self unserialize.
	serializedCache _ nil.
	to _ aString.! !

!IndexFileEntry methodsFor: 'printing' stamp: 'ls 2/6/1999 07:29'!
computeTOCString
	"Answer a string for the table of contents."
	"IndexFileEntry allInstancesDo: [: e | e flushTOCCache]"

	| fromFieldSize s |
	fromFieldSize _ 18.
	s _ WriteStream on: (String new: 200).
	s nextPutAll: self dateString.
	[s position < 9] whileTrue: [s space].
	s nextPutAll: (self fromStringLimit: fromFieldSize).
	[s position <= (9 + fromFieldSize + 2)] whileTrue: [s space].
	s nextPutAll: self subject.
	^ s contents
! !

!IndexFileEntry methodsFor: 'printing' stamp: 'ls 2/6/1999 07:29'!
fromStringLimit: limit
	"Answer a cleaned up 'from' field for the table of contents."

	| editedFrom s ch i |
	editedFrom _ WriteStream on: (String new: limit + 1).
	s _ ReadStream on: self from.
	s skipSeparators.
	('"<' includes: s peek) ifTrue: [s next].
	((i _ from indexOf: $() > 0) ifTrue: [s position: i].
	[s atEnd] whileFalse: [
		ch _ s next.
		(('@<>)$"' includes: ch) or: [editedFrom position >= limit])
			ifTrue: [^editedFrom contents]
			ifFalse: [editedFrom nextPut: ch]].
	^editedFrom contents
! !

!IndexFileEntry methodsFor: 'printing' stamp: 'ls 2/6/1999 07:30'!
printOn: aStream

	aStream nextPutAll: self dateString; cr.
	aStream nextPutAll: self from; cr.
	aStream nextPutAll: self to; cr.
	aStream nextPutAll: self cc; cr.
	aStream nextPutAll: self subject; cr.
	aStream nextPut: $(; nextPutAll: self location printString; space.
	aStream nextPutAll: self textLength printString; nextPut: $).
	aStream cr.! !

!IndexFileEntry methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:23'!
readBinaryFrom: aStream
	"Read a binary representation of myself on the given text stream."

	| len |
	len _ aStream nextInt32.
	serializedCache _ aStream next: len.
! !

!IndexFileEntry methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:22'!
writeBinaryOn: aStream
	"Write a binary representation of myself on the given text stream."

	serializedCache ifNil: [
		serializedCache _ self serialized ].

	aStream nextInt32Put: serializedCache size.
	aStream nextPutAll: serializedCache
! !

!IndexFileEntry methodsFor: 'read-write' stamp: 'ls 2/6/1999 07:30'!
writeOn: aStream
	"Write a human-readable representation of myself on the given text stream."
	| tempStream |
	tempStream _ RWBinaryOrTextStream on: (ByteArray new: 100).

	tempStream
		nextPutAll: self location printString; cr;
		nextPutAll: self textLength printString; cr;
		nextPutAll: self time printString; cr;
		nextPutAll: self from; cr;
		nextPutAll: self to; cr;
		nextPutAll: self cc; cr;
		nextPutAll: self subject; cr.

	aStream nextPutAll: (tempStream contents)! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:31'!
ccHas: stringOrList
	self unserialize.
	^ self field: cc has: stringOrList! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'jm 8/14/1998 13:39'!
field: field has: stringOrList
	"Return true if either the given field contains the argument string or, if the argument is a collection, return true if the given field contains any of the strings in that collection."

	| s |
	(stringOrList isKindOf: String) ifTrue: [
		^ field includesSubstring: stringOrList caseSensitive: false
	] ifFalse: [
		1 to: stringOrList size do: [ :i |
			s _ stringOrList at: i.
			s isNumber ifTrue: [s _ s printString].
			(field includesSubstring: s caseSensitive: false) ifTrue: [^ true].
		].
		^ false
	].! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:31'!
fromHas: stringOrList
	self unserialize.
	^ self field: from has: stringOrList! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:32'!
getMessage
	"Answer the MailMessage for this index file entry."
	self unserialize.
	^MailMessage from:
		(messageFile
			getMessage: msgID
			at: location
			textLength: textLength).
! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:32'!
participantHas: stringOrList
	self unserialize.
	^ (self field: from has: stringOrList) or:
	   [(self field: self to has: stringOrList) or:
	   [self field: self cc has: stringOrList]]! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:32'!
subjectHas: stringOrList
	self unserialize.
	^ self field: subject has: stringOrList! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:32'!
textHas: stringOrList
	self unserialize.
	^ self field: self getMessage text has: stringOrList
! !

!IndexFileEntry methodsFor: 'filtering support' stamp: 'ls 2/6/1999 07:32'!
toHas: stringOrList
	self unserialize.
	^ self field: to has: stringOrList! !

!IndexFileEntry methodsFor: 'private' stamp: 'ls 2/6/1999 07:37'!
isUncached
	"only used for stat taking"
	^cc isNil not! !

!IndexFileEntry methodsFor: 'private' stamp: 'ls 2/6/1999 07:23'!
serialized
	"return a serialized, binary version of this entry"
	
	| aStream |
	aStream _ RWBinaryOrTextStream on: (ByteArray new: 100).
	aStream
		nextInt32Put: location;
		nextInt32Put: textLength;
		nextInt32Put: time;
		nextStringPut: from;
		nextStringPut: to;
		nextStringPut: cc;
		nextStringPut: subject.


	^aStream contents! !

!IndexFileEntry methodsFor: 'private' stamp: 'ls 2/6/1999 07:36'!
serializedCache
	"only used for stat taking"
	^serializedCache! !

!IndexFileEntry methodsFor: 'private' stamp: 'ls 2/11/1999 22:16'!
unserialize
	"make sure our data has been retrieved from the serialized cache.  Should be called before any normal instance variables are accessed (location, textLength, time, from, to, cc, subject)"
	| aStream |
	location ifNotNil: [ "already unserialized" ^self ].

	serializedCache ifNil: [ 
		"no serialized cache to unserialize from -- receiver is probably being initialized"
		^self ].


	aStream _ (RWBinaryOrTextStream with: serializedCache) reset.

	location _ aStream nextInt32.
	textLength _ aStream nextInt32.
	time _ aStream nextInt32.
	from _ aStream nextString.
	to _ aStream nextString.
	cc _ aStream nextString.
	subject _ aStream nextString.
! !


!IndexFileEntry class methodsFor: 'instance creation' stamp: 'ls 2/5/1999 15:15'!
readBinaryFrom: aStream messageFile: aMessageFile msgID: msgID
	"Answer a new instance of me initialized from the given text stream."

	^(self new readBinaryFrom: aStream)
		messageFile: aMessageFile;
		msgID: msgID! !

!IndexFileEntry class methodsFor: 'instance creation'!
readFrom: aStream messageFile: aMessageFile msgID: msgID
	"Answer a new instance of me initialized from the given text stream."

	^(self new readFrom: aStream)
		messageFile: aMessageFile;
		msgID: msgID! !


!MailDB methodsFor: 'categories' stamp: 'ls 2/6/1999 07:59'!
messagesIn: categoryName
	"Answer a collection of message ID's for the messages in the given category, sorted in ascending time order.  If the category does not exist, answer an empty collection. The pseudo-categories '.all.' and '.unclassified.' are computed dynamically, which may take a little time."

	| msgList category |
	(categoryName = '.unclassified.') ifTrue:
		[Cursor execute showWhile:
			[msgList _ categoriesFile unclassifiedFrom: indexFile keys].
		 ^msgList].
	(categoryName = '.all.') ifTrue:
		[^indexFile keys].
	"otherwise, it is a real category"
	category _ categoriesFile messagesIn: categoryName.
	^category asOrderedCollection! !


!PositionableStream methodsFor: 'positioning' stamp: 'ls 2/5/1999 15:14'!
match: subCollection
	"Set the access position of the receiver to be past the next occurrence of the subCollection. Answer whether subCollection is found.  No wildcards, and case does matter."

	| pattern startMatch |
	pattern _ ReadStream on: subCollection.
	startMatch _ nil.
	[pattern atEnd] whileFalse: 
		[self atEnd ifTrue: [^false].
		(self next) = (pattern next) 
			ifTrue: [pattern position = 1 ifTrue: [startMatch _ self position]]
			ifFalse: [pattern position: 0.
					startMatch ifNotNil: [
						self position: startMatch + 1.
						startMatch _ nil]]].

	^true! !


!RWBinaryOrTextStream methodsFor: 'all' stamp: 'ls 2/7/1999 21:41'!
next: anInteger 
	"Answer the next anInteger elements of my collection. Must override to get class right."
	| endPosition newArray |
	endPosition _ position + anInteger min: readLimit.

	newArray _ (isBinary ifTrue: [ByteArray] ifFalse: [String]) new: anInteger.
	newArray replaceFrom: 1 to: anInteger with: collection startingAt: position+1.

	position _ endPosition.
	^newArray
! !


MailDBFile subclass: #IndexFile
	instanceVariableNames: 'msgDictionary '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Interface-Mail Reader'!





More information about the Squeak-dev mailing list