[OT] substring indexing

Richard A. O'Keefe ok at cs.otago.ac.nz
Tue Sep 2 05:01:31 UTC 2003


Martin Drautzburg <martin.drautzburg at web.de> asked:
	Does anybody know a good way to index a large set of strings (so)
	searching for strings that contain a given substring can be done fast?
	
	Is there a way to map strings to integers (i.e. indexing without
	a tree) so that the substring search can be mapped to examining
	a range of these integers and their associated strings . Or is
	this theoretically impossible ?

If anyone answered this question, they did so privately.
My understanding of the question is this:

    Given a fixed or slowly changing set S of strings
    is there a data structure that can be used so that
    q(x) = { s in S | exist a,z such that s = a++x++z }
    can be computed cheaply?

There are several such data structures.  Look for "suffix tree",
"suffix array", and "directed acyclic word graph".  For example,
the Compact DAWG of a text with n characters can be stored in 6n
integers (even, if I've understood the relevant articles correctly,
for wide (Unicode) characters).  Not only is the space cost linear
in the amount of text to be stored, the time to construct the index
is too.

http://www.nist.gov/dads/HTML/directedAcyclicWordGraph.html
is one place to start.  There are several books that discuss these matters
as well.

By the way, this is not necessarily a good way to do Information
Retrieval:  these methods consider _all_ characters on an equal basis
and have no interest in or awareness of words.



More information about the Squeak-dev mailing list