[Seaside] Fast text search and proper design

Avi Bryant avi.bryant at gmail.com
Sun Dec 12 12:17:07 CET 2004


On Sat, 11 Dec 2004 19:50:59 -0500, Benjamin Pollack
<benjamin.pollack at gmail.com> wrote:

> For this particular project, I would like to use GOODS rather than an
> RDBMS, but I need the ability to search very quickly through large
> amounts of text and return appropriate matches. I cannot find any
> effective data structure to use for that. Avi's B-tree won't really
> help, as near as I can tell, because I want to search *within*
> strings, and the default string operators likewise seem to be too
> primitive to be effective for this.

Scott Crosby wrote a fulltext search engine for Squeak a while back; I
suspect it would take a little bit of tweaking to make it work with
recent Squeak versions, but it would be a lot less work than starting
from scratch.  AFAIK it was only ever posted as a series of
enhancements to the mailing list, starting with this one:

http://impara.de/~bert/swiki.gsug.org/sqfixes/2203.html

If you were rolling your own, you certainly wouldn't want to use BTree
for text; instead, use the TSTree that is included in the BTree
package.  This does allow simple prefix and near-match searches on
individual strings, but doesn't provide a fulltext facility out of the
box.  What I've done in the past is to tokenize the text, and store a
set of objects in the TSTree at each token - if you're searching for
multiple tokens, you just intersect their sets.  If you expect these
sets to be big, and they have some immutable attribute you can sort
them on, you may want to use a TreeSet (also in the BTree package) to
hold them.

For lots of good information on the basics behind fulltext searching,
see Tim Bray's "on search" series of articles:

http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

HTH,
Avi


More information about the Seaside mailing list