Subclassing Engines (was Re: Moore's law and why persistence may not be necessary. (fwd))

Scott A Crosby crosby at qwes.math.cmu.edu
Wed Jan 23 16:38:00 UTC 2002


On Wed, 23 Jan 2002, Bijan Parsia wrote:

> On Wed, 23 Jan 2002, Scott A Crosby wrote:
>
> > On Wed, 23 Jan 2002, Bijan Parsia wrote:
> >
> >
> > Wait! You're not supposed to be futzing with my indexers, ONLY the
> > adaptors. Cause the indexers are going to seriously change soon. :)
>
> I subclassed DocumentIndex so I could use the following #add: method:
>
> add: anIndexFileEntry
> 	"Add a document into the database."
> 	| words |
> 	words _ indexAdaptor terms: anIndexFileEntry.
> 	words do: [ :word |
> 				|canonword set|
> 					canonword _ (indexAdaptor
> canonicalize: word).
> 					set _ dbase at: canonword
> ifAbsent: [IdentitySet new].
> 				     set add: anIndexFileEntry.
> 					dbase at: canonword put: set]
>
> So, what you're looking up is an IndexFileEntry rather than the entire
> text. I won't mind rewriting :)
>

This method looks to be *identical* to the origional, except for
changing the argument from 'aDocument' to 'anIndexFileEntry'

If you want to do what you're doing, just index descriptors of the
external entities as aDocument's.

Then, subclass the SimpleIndexAdaptor with something like:

DescriptorAdaptor>>terms: aDocument
   super terms: aDocument getTheRealStringFromWherever.

Thus, the search engine returns only descriptors, and does not store the
actual messages. (You could subclass String to make MessageID, add
#getTheRealStringFromWherever, and use that as a descriptor?)


Subclassing the engines should never be necessary. I explicitly factored
out the 'how to get data and how to deal with it' into adaptors so that
the engines would be interchangeable, and updating the engine updated
*all* users of it.

>
> No, I'm subclassing. The key bit is to not have to store all the text in
> the db. I don't *want* all the text :) I want all the text to stay on
> disk.

You don't need to subclass for that, just use the technique above, (or a
refactor that seperates out the 'get the real text' from the 'what parts
of the real text are terms'.)

>
> Indexing takes a *while*, of course. And, on this Win2000 box, squeak
> *engulfs* all available CPU time, which makes it hard to do anything
> else. I'm going to try threading the indexing in squeak, and see if that
> helps.
>

As long as you do that outside of my code. :)


> > > substitute in and IndexFileEntry for the fulltext in the Index. Which
> > > means that this 1) can be used directly by Celeste with fairly minimal
> > > modifications and 2) it's not going to take a *hell* of a lot more memory
> > > than the current Celeste situation and 3) you get blazing full text
> > > searchs for that extra memory.
> >
> > You *might* be surprised. I'm curious as to how big the index actually is.
> > I'll expect 30-60mb.
>
> Hmm. Y'know, I have little idea of how to measure space in Squeak :)
>

x _ Smalltalk bytesLeft.
di _ nil.
y _ Smalltalk bytesLeft.
y - x


> Here are my current VM statistics for memory:

I'd guess 20mb for the index or so.

--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.




More information about the Squeak-dev mailing list