[alpha-6][GOODIE] Full text indexing engine (fwd)

Scott A Crosby crosby at qwes.math.cmu.edu
Sat Apr 13 12:15:48 UTC 2002


A slightly newer version. I've filed off a small leftover wart from
early development.

Scott


---------- Forwarded message ----------
Date: Mon, 18 Feb 2002 02:24:36 -0500 (EST)
From: Scott A Crosby <crosby at qwes.math.cmu.edu>
Reply-To: squeak-dev at lists.squeakfoundation.org
To: Squeak List <squeak-dev at lists.squeakfoundation.org>
Subject: [alpha-5][GOODIE] Full text indexing engine

Full text indexing code:

   FullText-Engines        *required*
   FullText-DemoAdaptors   *suggested*
   FullText-MethodAdaptors   *demo*

The engines are required. The demoadaptors are new code and illustrate how
to use the full text engines. The methodadaptors code is demo and probably
the wrong way to hook into the system. (Nothing else came up in a search;
we *need* full text search on comments.)

-----
For quick-start, the required changesets other than these are:
   /tmp/final/MiscEnhancements.2.cs
   /tmp/final/SkipListRefactor.4.cs
   /tmp/final/String-Enhancements.1.cs

Which work if filed in in that order, then filing in these changesets.

-----

Here is the next version of the full indexing code. I'd like to think
Bijan for helping to make this release more featureful and helping me
explain it better.

I'll call this my alpha-5 version. The stuff works and has been tested. I
consider the engines mostly feature complete and stable. Bijan has been
building indices with this engine, so I would consider it usable for
integrating it into the rest of squeak. A lot of adaptors and
integration work still remain to be done.

This version is entirely self-contained, and does not update the test
versions formerly released.

I create/plan to create 3 new packages:
  Fulltext-Engine
  Fulltext-Adaptors
  Fulltext-Utilities

The two *Adaptor changesets are optional, but exist for demo purposes. The
MethodAdaptor changeset should be obsoleted by someone else who knows that
side of squeak better. MiscEnhancements must be filed in before
DemoAdaptors. (See that email for details.)

-----------------------------------------------------------------------
For demo, this code can do a full text index of an entire image of source
code in 12 minutes, creating a 10mb index. That index may then be
condensed to 6.5mb. It can also create a 3mb index in 12 seconds that
allows doing an implementorsOf searches for any selector in a few ms.

I do full text searches of all methods in seconds or less:

MessageTally spyOn: [di anyOf: #(isvalid #morph #true #self)]  "3000   ms"
MessageTally spyOn: [di anyOf: #(isvalid #morph #true)]         "251   ms"
MessageTally spyOn: [di anyOf: #(isvalid #morph)]                "36   ms"
MessageTally spyOn: [di anyOf: #(isvalid)]                       "10   ms"

I can also do combined searches, finding, say, implementors of #foo that
also contain the text "incomplete", also in under a second.

All in all, unless you have result sets containing tens of thousands of
matches, it'll be hard to find a search that takes even a second.
----------------------------------------------------------------------
Indexing is fairly fast. About 30k/sec on a stock VM, 90k/sec on my VM.

The space utilization is not unreasonable..

It will index 13mb of method text (40k methods), to build an index of 800k
entries in 10mb.  But, that index may be condensed to 6.5mb.

Condensed indexes use arrays instead of sets. Searching condensed indexes
is just as fast, but the arrays must be converted to Set's before updates
can be made. The current code does this automatically..  Recondensing must
be done manually.

There are no performance problems in the current code. When indexing all
methods, about 40% of the CPU time is spend getting the method source from
the source file.
--------------------------------------------------------------------

There are two engines, ExactDocumentIndex and OrderedDocumentIndex

Both index Documents based on the terms in them. An adaptor is used to
adapt a Documents by converting them into terms. (ex. extracting
substrings) To *any* adapt a class for indexing, just write an adaptor for
it.

Both engines allow #anyof: searches, substring searches, and dynamic
adding and removing of documents, Both also offer a termSubstring (find
all terms that have the given substring.)

The OrderedDocumentIndex offers everything ExactDocumentIndex offers (at
about half the performance), but it also allows additional queries like:

   di termPrefixOf: #isVali
   di prefixOf: #isVali
   di termRangeFrom: #is to: #is:before:
   di rangeFrom: #is to: #is:before:

Which are queries for matching all terms in the range, or, all terms that
have the requested prefix, or documents containing terms in the range or
all documents containing a term with the prefix.

I also allow the items to be stored in either Set or IdentitySet. WeakSets
should work (in theory, but untested).  IndentitySet has subtle semantics
that may cause problems, so may not necessarily be a good choice.

Sometimes, you may not want to index a document directly, you may instead
with to index a key that references a document. The add:withKey: and
remove:withKey: exists for this purpose. Then, if you query the index, it
will return those keys intead of documents.  The normal #add/#remove
methods index a document using the document itself as the key.

------------------------------------------------------------------
Although the indexing engines are fast, they may expose latent performance
problems Foo>>hash, and Foo>>=. This is because they build
hundreds of collections potentially containing thousands of elements...
Not something squeak typically faces.

In some cases, my #identityHash patches *may* help.

-----------------------------------------------------------
For the implements-X method index..

Rather than build another way to reference methods, I enhanced an existing
class. I have made minor modifications to MethodReference to suit me, the
modification to '=' disturbs me slightly, but can be easily removed.

Note that these MethodReferences are not built unique. To build a full set
of MethodReferences for 40k the methods in the system is about 1.3mb. To
reduce bloat, these should be shared between several indices.

----------------------------------------------------
For the future:

We need to integrate this into the other parts of squeak.

Basic use of the engines is trivial. Just write your adaptor and #add or
#remove documents.

See the demo text near the end, and the demo string adaptor class.
------------------------------------------------------------------
The size info of a fulltext and implementors index:

Full text index size is: 10925236
 828372 for the strings for the 51603 terms.
1123304 spent on MethodRefs (which may be shared)
5112484 on the indexing core. (800k refs, 40k documents,50k terms)
4208332 of slack space in the 51603 sets. (removed when condensing)

The index of implementors-of size is: 2741924
 701568 for the indexing core (40k refs, 40k documents,25k terms)
1123304 spent on MethodRefs (which may be shared)
 917052 on slack space in the sets. (removed when condensing)

------------------------ Sample code --------------------------
Smalltalk vmParameterAt:5 put: 10000000

di _ ExactDocumentIndex withAdaptor: (DemoStringAdaptor new) withSet: Set.

di add: 'hello, I am Scott Crosby'.
di add: 'Scott crosby is my name'.
di add: 'Scott is my name'.
di add: 'What is your name?'.
di add: '123456'. di dbase.

di anyOf: #(scott) asSet.

di remove: 'hello, I am Scott Crosby'.

di fastAllOf: #(Scott Crosby)
di allOf: #(Scott Crosby)
di anyOf: #(Scott Crosby)

di contains: 'scott'
di containsNoCopy: 'scott'

di condense.

"---"


" Which search engine and which adaptor do you want to use? "
" Certain queries are or are not possible for all engines. I also have
  two sample adaptors. Thus, any of these lines construct a valid index.
  Choose one and DoIt"

di _ ExactDocumentIndex withAdaptor: (ImplementorsMethodAdaptor new)
withSet: Set.
di _ ExactDocumentIndex withAdaptor: (FulltextMethodAdaptor new) withSet:
Set.
di _ OrderedDocumentIndex withAdaptor: (ImplementorsMethodAdaptor new)
withSet: Set.
di _ OrderedDocumentIndex withAdaptor: (FulltextMethodAdaptor new)
withSet: Set.

"Populate the index"
MessageTally spyOn: [Smalltalk allClassesDo: [ :aClass |
   aClass selectorsDo: [ :method |
       di add: (MethodReference new setStandardClass: aClass methodSymbol:
method)].
   aClass class selectorsDo: [ :method |
       di add: (MethodReference new setStandardClass: aClass class
methodSymbol: method)]]].

 "Queries: On either index type"
(di termSubStringOf: 'Valid') size.
(di subStringOf: 'Valid') size.
di anyOf: (Array with: 'isValid' asSymbol).
di anyOf: #(foo)


 "Queries: Using the ordered index."
di termPrefixOf: #isValid
di prefixOf: #isValid
di termRangeFrom: #is to: #is:before:
di rangeFrom: #is to: #is:before:










-------------- next part --------------
'From Squeak3.2alpha of 11 October 2001 [latest update: #4646] on 13 April 2002 at 8:03:34 am'!
"Change Set:		FullText-DemoAdaptors
Date:			24 January 2002
Author:			

<your descriptive text goes here>"!

Smalltalk renameClassNamed: #SimpleIndexAdaptor as: #DemoStringAdaptor!
IndexAdaptor subclass: #DemoStringAdaptor
	instanceVariableNames: ''
	classVariableNames: 'CSAlphabetic CSNonAlphabetic Ignore '
	poolDictionaries: ''
	category: 'FullText-Adaptors'!
Smalltalk renameClassNamed: #SimpleTextIndexAdaptor as: #DemoTextAdaptor!
DemoStringAdaptor subclass: #DemoTextAdaptor
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'FullText-Adaptors'!

!DemoStringAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/22/2002 18:58'!
canonicalize: aString
	"Given a string or a term, canonicalize it for searching."
	^aString asLowercase
! !

!DemoStringAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 17:18'!
terms: aString
	"Given a document, give a list of the search terms occuring in it."
	| words |

	words _ aString substrings: CSAlphabetic complement: CSNonAlphabetic.
	words _ words select: [ :word | (word size > 3) & (Ignore includes: word) not].
	^words
! !


!DemoStringAdaptor class methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 18:22'!
initialize
	super initialize.
	CSAlphabetic _ CharacterSet allAlphabetic.
	CSNonAlphabetic _ CSAlphabetic complement.
	Ignore _ #(iftrue iffalse self value do aBlock) asSet.! !


!DemoTextAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/22/2002 19:32'!
canonicalize: aString
	"Given a string or a term, canonicalize it for searching."
	^aString asLowercase
! !

!DemoTextAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/22/2002 19:33'!
terms: aText
	"Given a document, give a list of the search terms occuring in it."
	| |
	^super terms: aText asString.! !

DemoStringAdaptor initialize!
DemoStringAdaptor removeSelector: #condense:!
-------------- next part --------------
'From Squeak3.2alpha of 11 October 2001 [latest update: #4646] on 13 April 2002 at 8:03:38 am'!
Object subclass: #DocumentIndex
	instanceVariableNames: 'setClass indexAdaptor documentCount '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'FullText-Engines'!

!DocumentIndex commentStamp: 'sac 1/24/2002 18:41' prior: 0!
The base class for document indexes. This supports both allocation of new sets, and expensive search algorithms like substring matching.

	setClass is the class that is to be used when constructing results. Weak Sets have not been tested, but should also work. I support adding and removing documents, substring searching, and any-of matching.

This is an abstract class.!

DocumentIndex subclass: #ExactDocumentIndex
	instanceVariableNames: 'dbase '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'FullText-Engines'!

!ExactDocumentIndex commentStamp: '<historical>' prior: 0!
Implement a document index based on hash tables. 

Thus, this document index supports exact queries and expensive queries.!

Object subclass: #IndexAdaptor
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'FullText-Adaptors'!

!IndexAdaptor commentStamp: 'sac 1/23/2002 19:45' prior: 0!
An adaptor is something that adapts a document for indexing. A document may be any entitity, a String, a Text, a descriptor of a network or filesystem entity.

Given a document, the adaptor returns search terms that occur in the document. It also canonicalizes search terms.

Thus, to make any document indexable, just write an adaptor for it. 
!

DocumentIndex subclass: #OrderedDocumentIndex
	instanceVariableNames: 'dbase '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'FullText-Engines'!

!OrderedDocumentIndex commentStamp: '<historical>' prior: 0!
Store search terms in skiplist. 

This supports ranged queries and prefix queries.!


!Collection methodsFor: 'enumerating' stamp: 'sac 1/28/2002 18:30'!
intersectAllInto: aClass
	"Intersect a collection of collections. For speed, the result collection should be a Set or IdentitySet."
	| sum bySize |
	self size = 0 ifTrue: [^aClass new].
	sum _ nil.
	bySize _ SkipList sortBlock: [:c1 :c2 | c1 size < c2 size].
	bySize addAll: self.
	bySize do: [ :each | 
				 sum ifNil: [sum _ each as: aClass]
                          ifNotNil: [sum _ sum intersection: each asSet]].
	^sum.
! !

!Collection methodsFor: 'enumerating' stamp: 'sac 1/28/2002 18:29'!
unionAllInto: aClass
	"Intersect a collection of collections. For speed, the result collection should be a Set or IdentitySet."
	| sum |
	self size = 0 ifTrue: [^aClass new].
	sum _ nil.
	self do: [ :each | 
				 sum ifNil: [sum _ aClass new addAll: each]
                          ifNotNil: [sum _ sum union: each]].
	^sum.
! !


!DocumentIndex methodsFor: 'updating' stamp: 'sac 1/25/2002 10:55'!
add: aDocument
	"Add a document into the database."
	^self add: aDocument withKey: aDocument! !

!DocumentIndex methodsFor: 'updating' stamp: 'sac 4/13/2002 07:58'!
condense
	"Condensing an index can make the index smaller (about 10-40%), But it disables all 
      future updates to the index. 

      Once condensed, you can neither add nor remove a document from the index. 
	 Furthermore, after being condensed, any search is allowed to return a non-Set Collection."

	"The operation it does, is it replaces each result set with an Array, and condenses the documents that are put into the array."
	self dbase associationsDo: [ :each | 
	      each value: each value asArray.].
	Smalltalk tenure "Insure the root table doesn't contain the dictionary."! !

!DocumentIndex methodsFor: 'updating' stamp: 'sac 1/25/2002 10:56'!
remove: aDocument
	"Remove a document into the database. Note that the document must be unchanged from when it was added in, to insure that it is correctly removed."
	^self remove: aDocument withKey: aDocument
! !

!DocumentIndex methodsFor: 'updating' stamp: 'sac 1/25/2002 19:49'!
removeStopTerm: aWord
	"Remove a particular term from the database entirely. For example, removing the most frequent terms from a database as a stoplist check."
	| canonword |
	canonword _ indexAdaptor canonicalize: aWord.
	self dbase at: canonword put: setClass new.! !

!DocumentIndex methodsFor: 'updating' stamp: 'sac 1/28/2002 18:41'!
uncondense
	"Uncondensing a condensed index allows changes to be made, but makes the condensed index bigger. Note that if the adaptor does anything special during the condense, that operation will not be undone. For an index that has been modified heavily over a long period of time, uncondensing may restore some space. uncondensing rebuilds each set in the index."

	self dbase associationsDo: [ :each | |temp|
		  temp _ self setClass new. 
		  temp addAll: each value.
	       each value: temp].
	Smalltalk tenure. "Ensure the root table doesn't contain the dictionary."! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/25/2002 12:47'!
allOf: aSet
	"Return the set of documents that matches all of the words in the given set."
	| sum |
	aSet size = 0 ifTrue: [^setClass new].
	sum _ nil.
	aSet do: [ :each | 
				 sum ifNil: [sum _ setClass new addAll: (self containsNoCopy: each)]
                          ifNotNil: [sum _ sum intersection: (self containsNoCopy: each)]].
	^sum.! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/25/2002 13:20'!
anyOf: aSet
	"Return the set of documents that matches any of the words in the given set."
	| sum |
	sum _ setClass new.
	"If the set is empty, return an empty set."
	"aSet size = 0 ifTrue: [^sum]."
	"If the set has one thing, return the set straight from the dbase without copying."
	"aSet size = 1 ifTrue: [^self dbase at: (indexAdaptor canonicalize: aSet anyOne) ifAbsent: [sum]]. DISABLED!!!! I wish to allow clients to destructively mutate the resultant set."

	"Otherwise, merge them together."	
	aSet do: [ :item |  
                  sum union: (self containsNoCopy: item)
           ].
	^sum
! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:22'!
contains: aTerm
	"Find all documents that contain aTerm."
	^(self dbase at: (indexAdaptor canonicalize: aTerm) ifAbsent: [^self setClass new]) copy.! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:21'!
containsNoCopy: aTerm
	"Find all documents that contain aTerm. The resultant set is a clone of the internal one. Thus, it MUST NOT BE MUTATED. The returned value should not be fed into any union: or intersect: or other set-operation."
	^self dbase at: (indexAdaptor canonicalize: aTerm) ifAbsent: [^self setClass new]! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/28/2002 18:31'!
fastAllOf: aSet
	"Return the set of documents that matches all of the words in the given set. This version reorders the intersections for speed, but saves intermediate results."
	| sum |
	sum _ aSet collect: [:each | self containsNoCopy: each ].
	^sum intersectAllInto: self setClass.
! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:23'!
subStringOf: aString
	"Return all documents containing a substring of the requested string."

	| sum |

	sum _ self setClass new.
	self dbase associationsDo: [ :each | 
            (each key findString: (indexAdaptor canonicalize: aString)) = 0 
               ifFalse: [ sum union: each value] ].
	^sum 
		
		! !

!DocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:16'!
termSubStringOf: aString
	"Return all terms containing a substring of the requested string."

	| sum |

	sum _ Set new.
	self dbase associationsDo: [ :each | 
            (each key findString: (indexAdaptor canonicalize: aString)) = 0 
               ifFalse: [ sum add: each key] ].
	^sum 
		
		! !

!DocumentIndex methodsFor: 'private' stamp: 'sac 1/23/2002 15:10'!
dbase
	"Return the current database. It should be a collection that can be scanned/enumerated."
	self subclassResponsibility.! !

!DocumentIndex methodsFor: 'private' stamp: 'sac 1/25/2002 19:46'!
getSetFor: aWord
	"Get and return the set associated with a porticular (uncanonicalized) word. Automatically uncondense it if necessary."
	| canonword set |
	canonword _ indexAdaptor canonicalize: aWord.
	set _ self dbase at: canonword ifAbsent: [
               |x|   x _ setClass new. self dbase at: canonword put: x. ^x].
	
	set class == Array ifTrue: [set _ setClass new. 
                                    set addAll: set. 
                                    self dbase at: canonword put: set].
	^set
! !

!DocumentIndex methodsFor: 'private' stamp: 'sac 1/25/2002 01:15'!
initialize: anIndexAdaptor withSet: aSetClass
	"Initialize both the adaptor, and the type of Set that will be used internally."
	indexAdaptor _ anIndexAdaptor.
	setClass _ aSetClass.
	"To make sure that the class has been reinitialized."
	anIndexAdaptor class initialize.
	documentCount _ 0.
	! !

!DocumentIndex methodsFor: 'private' stamp: 'sac 1/23/2002 16:12'!
setClass
	"Return the class of sets that this index uses."
	^setClass! !

!DocumentIndex methodsFor: 'feedback' stamp: 'sac 1/24/2002 18:37'!
allTerms
	"Return a set of all terms in the dbase"
	^self dbase keys.! !

!DocumentIndex methodsFor: 'feedback' stamp: 'sac 1/25/2002 01:16'!
documentCount
	"Return a count of the number of documents indexed. Note that it is a little naive, but you shouldn't be adding documents more than once."
	^documentCount! !

!DocumentIndex methodsFor: 'feedback' stamp: 'sac 1/25/2002 00:58'!
termsByFrequency
	"Return a SortedCollection of all terms, sorted by their frequency of occurance. Output is an ordered collection of associations of term and count."

	| sorted |
	sorted _ SortedCollection sortBlock: [:x :y | x value > y value].
	self dbase keysAndValuesDo: [ :key :val | 
             sorted add: (Association key: key value: val size)].
	^sorted

! !

!DocumentIndex methodsFor: 'advanced updating' stamp: 'sac 1/25/2002 19:35'!
add: aDocument withKey: aKey
	"Add a document into the database under the given key. This is to make things that are stored by reference easier. When you search, you get the keys back. This method is to allow some pre-parsing of aKey when making the document. For simple indexing, use #add:"
	| words |
	words _ indexAdaptor terms: aDocument.
	words do: [ :word |
				|canonword set|
					"canonword _ (indexAdaptor canonicalize: word).
					set _ self dbase at: canonword ifAbsent: [setClass new]."
					set _ self getSetFor: word.
				     set add: aKey.
					"self dbase at: canonword put: set"].
	documentCount _ documentCount + 1.! !

!DocumentIndex methodsFor: 'advanced updating' stamp: 'sac 1/25/2002 19:29'!
remove: aDocument withKey: aKey
	"Remove a document into the database, where the document is indexed under the given key. Note that the document and key must be unchanged from when it was added in, to insure that it is correctly removed."
	| words |
	words _ indexAdaptor terms: aDocument.
	words do: [ :word |
				|set|
				"set _ self dbase at: (indexAdaptor canonicalize: word) ifAbsent: [setClass new]."
				 set _ self getSetFor: word.
                   set remove: aKey ifAbsent: []].
	documentCount _ documentCount - 1.
! !


!ExactDocumentIndex methodsFor: 'private' stamp: 'sac 1/22/2002 19:55'!
dbase
	^dbase.
! !

!ExactDocumentIndex methodsFor: 'private' stamp: 'sac 1/23/2002 19:48'!
initialize: anIndexAdaptor withSet: aSetClass
	"See superclass method with same name."
	super initialize: anIndexAdaptor withSet: aSetClass.
	dbase _ Dictionary new.
! !


!ExactDocumentIndex class methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 15:35'!
withAdaptor: anIndexAdaptor withSet: aSetClass
	"Build an index over documents. IndexAdaptor is something responsible for interfacing documents to and from the index code, using aSetClass to store the individual sets."
	^self new initialize: anIndexAdaptor withSet: aSetClass.! !


!IndexAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 19:46'!
canonicalize: aString
	"Given a string or a term, canonicalize it for searching. Canonicalization is used to remove unimportant differences (case, capitalization, etc) in search terms."
	self subclassResponsibility! !

!IndexAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 19:46'!
terms: aDocument
	"Given a document, give a Set of the search terms occuring in it."
	self subclassResponsibility! !


!OrderedDocumentIndex methodsFor: 'private' stamp: 'sac 1/22/2002 20:12'!
dbase
	^dbase.
! !

!OrderedDocumentIndex methodsFor: 'private' stamp: 'sac 1/23/2002 19:48'!
initialize: anIndexAdaptor withSet: aSetClass
	"See superclass method with same name."
	super initialize: anIndexAdaptor withSet: aSetClass.
	dbase _ SkipList new.
! !

!OrderedDocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:25'!
prefixOf: aString
	"Return the set of documents that have a term that matches the prefix of the given string:"
	| sum node term |
	term _ (indexAdaptor canonicalize: aString).
	sum _ self setClass new.
	node _ dbase search: term.
	node ifNil: [^sum]. "No key exists."
	[node key beginsWith: term]
		whileTrue:  [sum union: node value.
					node _ node next.
					node ifNil: [^sum]].
	^sum! !

!OrderedDocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:26'!
rangeFrom: startString to: endString
	"Return the set of documents that contain any term in the range of [startString, endString]"
	| sum node startTerm endTerm |
	startTerm _ (indexAdaptor canonicalize: startString).
	endTerm _ (indexAdaptor canonicalize: endString).
	sum _ self setClass new.
	node _ dbase search: startTerm.
	node ifNil: [^sum]. "No key exists."
	[node key <= endTerm]
		whileTrue:  [sum union: node value.
					node _ node next.
					node ifNil: [^sum]].
	^sum! !

!OrderedDocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:27'!
termPrefixOf: aString
	"Return the set of terms that have a term that matches the prefix of the given string:"
	| sum node term |
	term _ (indexAdaptor canonicalize: aString).
	sum _ Set new.
	node _ dbase search: aString.
	node ifNil: [^sum]. "No key exists."
	[node key beginsWith: term]
		whileTrue:  [sum add: node key.
					node _ node next.
					node ifNil: [^sum]].
	^sum! !

!OrderedDocumentIndex methodsFor: 'searching' stamp: 'sac 1/24/2002 17:26'!
termRangeFrom: startString to: endString
	"Return the set of documents that contain any term in the range of [startString, endString]"
	| sum node startTerm endTerm |
	startTerm _ (indexAdaptor canonicalize: startString).
	endTerm _ (indexAdaptor canonicalize: endString).
	sum _ Set new.
	node _ dbase search: startTerm.
	node ifNil: [^sum]. "No key exists."
	[node key <= endTerm]
		whileTrue:  [sum add: node key.
					node _ node next.
					node ifNil: [^sum]].
	^sum! !


!OrderedDocumentIndex class methodsFor: 'as yet unclassified' stamp: 'sac 1/23/2002 15:35'!
withAdaptor: anIndexAdaptor withSet: aSetClass
	"Build an index over documents. IndexAdaptor is something responsible for interfacing documents to and from the index code, using aSetClass to store the individual sets."
	^self new initialize: anIndexAdaptor withSet: aSetClass.! !

IndexAdaptor removeSelector: #condense:!

!DocumentIndex reorganize!
('updating' add: condense remove: removeStopTerm: uncondense)
('searching' allOf: anyOf: contains: containsNoCopy: fastAllOf: subStringOf: termSubStringOf:)
('private' dbase getSetFor: initialize:withSet: setClass)
('feedback' allTerms documentCount termsByFrequency)
('advanced updating' add:withKey: remove:withKey:)
!

-------------- next part --------------
'From Squeak3.2alpha of 11 October 2001 [latest update: #4646] on 24 January 2002 at 5:34:39 pm'!
IndexAdaptor subclass: #FulltextMethodAdaptor
	instanceVariableNames: ''
	classVariableNames: 'CSAlphabetic CSNonAlphabetic '
	poolDictionaries: ''
	category: 'FullText-Adaptors'!

!FulltextMethodAdaptor commentStamp: 'sac 1/24/2002 17:34' prior: 0!
Do case-insensitive searching on the alphabetic text within the body of each method, as represented as a MethodReference.
!

IndexAdaptor subclass: #ImplementorsMethodAdaptor
	instanceVariableNames: ''
	classVariableNames: 'CSAlphabetic CSNonAlphabetic '
	poolDictionaries: ''
	category: 'FullText-Adaptors'!

!ImplementorsMethodAdaptor commentStamp: 'sac 1/24/2002 17:33' prior: 0!
Given a MethodReference, return the symbol representing the function that it implements. (For implementors-of searches)

!

Object subclass: #MethodReference
	instanceVariableNames: 'classSymbol classIsMeta methodSymbol stringVersion hash '
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Tools-Browser'!

!FulltextMethodAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:01'!
canonicalize: aString
	"Given a string or a term, canonicalize it for searching."
	^aString asLowercase
! !

!FulltextMethodAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:46'!
terms: aMethodReference
	"Given a document, give a list of the search terms occuring in it."
	| words |
	words _ (aMethodReference actualClass sourceCodeAt: aMethodReference methodSymbol) asString
         substrings: CSAlphabetic complement: CSNonAlphabetic.
	^words
! !


!FulltextMethodAdaptor class methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:01'!
initialize
	super initialize.
	CSAlphabetic _ CharacterSet allAlphabetic.
	CSNonAlphabetic _ CSAlphabetic complement.! !


!ImplementorsMethodAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:03'!
canonicalize: aString
	"Given a string or a term, canonicalize it for searching."
	^aString
! !

!ImplementorsMethodAdaptor methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:04'!
terms: aMethodReference
	"Given a document, give a list of the search terms occuring in it."
	| words |
	words _ Set with: (aMethodReference methodSymbol).
	^words
! !


!ImplementorsMethodAdaptor class methodsFor: 'as yet unclassified' stamp: 'sac 1/24/2002 14:03'!
initialize
	super initialize.
	CSAlphabetic _ CharacterSet allAlphabetic.
	CSNonAlphabetic _ CSAlphabetic complement.! !


!MethodReference methodsFor: 'queries' stamp: 'sac 1/24/2002 13:36'!
asStringOrText

	^self stringVersion! !

!MethodReference methodsFor: 'setting' stamp: 'sac 1/24/2002 13:34'!
setClass: aClass methodSymbol: methodSym stringVersion: aString 

	classSymbol _ aClass theNonMetaClass name.
	classIsMeta _ aClass isMeta.
	methodSymbol _ methodSym.
	stringVersion _ aString.
	hash _ nil.! !

!MethodReference methodsFor: 'setting' stamp: 'sac 1/24/2002 13:34'!
setClassSymbol: classSym classIsMeta: isMeta methodSymbol: methodSym stringVersion: aString 

	classSymbol _ classSym.
	classIsMeta _ isMeta.
	methodSymbol _ methodSym.
	stringVersion _ aString.
	hash _ nil.! !

!MethodReference methodsFor: 'setting' stamp: 'sac 1/24/2002 13:44'!
setStandardClass: aClass methodSymbol: methodSym

	classSymbol _ aClass theNonMetaClass name.
	classIsMeta _ aClass isMeta.
	methodSymbol _ methodSym.
	stringVersion _ nil.
	hash _ nil.! !

!MethodReference methodsFor: 'string version' stamp: 'sac 1/24/2002 13:50'!
stringVersion

	stringVersion ifNil:
	     [stringVersion _ self actualClass name, ' ' , self methodSymbol].
	^stringVersion! !

!MethodReference methodsFor: 'comparisons' stamp: 'sac 1/24/2002 17:06'!
= anotherMethodReference

	"self species == anotherMethodReference species ifFalse: [^ false]." 
	"Removing the line above gives an overall 30% boost to searching speed." 
	classSymbol == anotherMethodReference classSymbol ifFalse: [^false].
	classIsMeta == anotherMethodReference classIsMeta ifFalse: [^false].
	^methodSymbol == anotherMethodReference methodSymbol
! !

!MethodReference methodsFor: 'comparisons' stamp: 'sac 1/24/2002 13:41'!
hash
	hash ifNil: [
		hash _ classIsMeta hash hashMultiply.
		hash _ (classSymbol hash + hash) hashMultiply.
		hash _ (methodSymbol hash + hash).].
	^hash.! !

ImplementorsMethodAdaptor initialize!
FulltextMethodAdaptor initialize!


More information about the Squeak-dev mailing list