Searching magma string indices
Brent Pinkney
brent at zamail.co.za
Wed Mar 7 16:10:36 UTC 2007
Hi Chris,
Thanks for your suggestions: I think we are getting closer. However:
> > Also, whilst you are hunting elephants, SQL has both % and ?
> > wildcards: % is any sequence of characters inlcuding the empty string
> > and ? is precicely one character.
> > So foo??bar would match fooABbar but not fooCDEbar and not fooFbar.
> >
> > Any chance you could bend Magma's indices into managing expressions
> > with a fixed number of ?s (e.g. ?foo????bar??baz???)
>
> Yes, there are several ways to do this but here's one. A new index
> type that simply indexes a unique value for each character at a
> particular position. Returning to the previous example, an object
> whose #familyName attribute is "fredfoon tobart", this index type would
> to index the object at the following values:
>
> (256 * 1) + $f asciiValue
> (256 * 2) + $r asciiValue
> (256 * 3) + $e asciiValue
> (256 * 4) + $d asciiValue
> (256 * 5) + $f asciiValue
> (256 * 6) + $o asciiValue
> (256 * 7) + $o asciiValue
> (256 * 8) + $n asciiValue
> (256 * 9) + Character space asciiValue
> (256 * 10) + $t asciiValue
> (256 * 11) + $o asciiValue
> (256 * 12) + $b asciiValue
> (256 * 13) + $a asciiValue
> (256 * 14) + $r asciiValue
> (256 * 15) + $t asciiValue
>
> Then, whenever a single-character-matching string is specified, every
> character other than the question-marks becomes a conjunction in the
> query. If the user is searching:
>
> where: [ : p | p familyName like: '????foon toba??' ]
>
> then it would have to be parsed into the following regular clause:
>
> (familyName = ((256*5) + $f asciiValue)
> & (familyName = ((256*6) + $o asciiValue)
> & (familyName = ((256*7) + $o asciiValue)
> & (familyName = ((256*8) + Character space asciiValue)
> & (familyName = ((256*9) + $t asciiValue)
> & (familyName = ((256*10) + $o asciiValue)
> & (familyName = ((256*11) + $b asciiValue)
> & (familyName = ((256*12) + $a asciiValue)
>
> Notice the question marks are the ones on which we are not qualifying,
> so any value can be placed in positions 1-4 and 13-14.
>
Note: SQL actually uses '_' not '?' for a single wild character. Sorry: I continue to use '?' in this thread for consistency.
This all works except for expressions of the form: '???' i.e. where we are searching for all strings with exactly three characters.
I have yet to think of how to include this boundary condition except that we could additionally index on the size of the strings. In this case the MaHashIndex
would include a record with key = 3 and values = 'all string of length three'.
Finally, we also need to support expressions like '%foo%bar?b??t?'. I am hoping your earlier suggestions on searching for %'s can be merged into the same index
as the ?s.
Any ideas ?
Brent
More information about the Magma
mailing list