[BETA][DOCS] Design decisions for the full text indexers.

Scott A Crosby crosby at qwes.math.cmu.edu
Wed Jan 23 15:56:34 UTC 2002


I need to decide on how the indexers work.

So, I'm giving a summary of what adaptors are and how they are used. Then
I give the queries the core search engines currently support.

Finally, I'll ask a couple of questions about 'the smalltalk way' for
naming classes.

And last, I'll offer a few hints for the future.

--

First, a summary of the interface. For any document tyoe, you must supply
at least one adaptor, which does two things. It extracts strings[*] from
a document, and also canonicalizes searches. IE, when matching, I don't
look for
  ``foo=bar'', but
  ``(adaptor canonicalize: foo) = (adaptor canonicalize: bar)''
One obvious canonicalizer I use is 'String>>lowercase'. Canonicalizers are
used to remove irrelevant differences from being indexed.

A document may be anything, a String, a Text, a reference to method source
somewhere on the drive, a SmallInteger, anything. All you need is to
define an adaptor to read it and return the strings[*] in it. The result
of any search is a Set or IdentitySet of the matching documents.

A particular document type may want to be indexed in different ways. This
is why I have adaptors, used to prepare documents for indexing. Thus, you
may want MethodFullSourceAdaptor MethodAuthorAdaptor
MethodCommentsAdaptor, MethodInvokesAdaptor. MessageSubjectAdaptor,
MessageBodyAdaptor. MessageHeadersAdaptor, etc.

I have a couple of simple cheezy adaptors that just pull alphabetic
strings out of Text or String, and lowercase them.

[*] By strings, I mean symbols, strings, tokens, whatever. Basically,
something I want associated with the document in the index. Furthermore,
not all search engines require them to necessarily be strings.

-- 

First, please do not futz with the actual engines, either directly or
through inheritence. Just write adaptors and use the engine via
containment. The engines are very incomplete and will be
changed/enhanced/refactored a lot,

Second, based on feedback from the naming advice I solicit below, names of
classes will change.


--
Ok, my indexers/engine so far.

I supplied one indexer that allows for equality matching to a subset.
   di anyOf: #(foo bar baz)

I have another indexer that allows for prefix-matching or equality
matching to a subset:
   pdi anyOf: #(foo bar baz)
   pdi prefixOf: 'squeak'

Where the last matches any document with any string matching: 'squeak*'.

--

Now for the questions:

I'm really unfamiliar with the smalltalk world, so:

Any suggested naming scheme for the Adaptors?
  FooTextAdaptor, FooStringAdaptor  ('*Adaptor')
  FooTextSearchAdaptor, FooStringSearchAdaptor ('*SearchAdaptor')
or.. ??

Naming scheme for the search engines?
  PrefixFullTextSearch, ExactFullTextSearch ('*FullTextSearch')
  PrefixSearch, ExactSearch ('*Search')
  PrefixSearchEngine, ExactSearchEngine ('*SearchEngine')
  PrefixIndexer, ExactIndexer ('*Indexer')
or.. ??

Finally, what should be the naming scheme, if any, of people who make
indexers based on my engine. (like Bijarn). Do I want to use
('*SearchEngine') to avoid name confusion with users of the engines
that are called ('*Indexer')

--
Now for search engine features:

I plan on doing a substring-matching search engine. (find any document
that contains a string that contains the search term.) The catch is that
it'll be slowish (have to do a full text search through the dictionary of
all search terms in all documents.)

Because its almost free, I'll support ranged queries. Find me all
documents containing a string in a particular range.

I also plan on allowing the search engine to hold onto documents weakly,
or use hash versus identityhash.

Scott






More information about the Squeak-dev mailing list