Searching magma string indices

Brent Pinkney brent at zamail.co.za
Wed Mar 7 16:10:36 UTC 2007


Hi Chris,

Thanks for your suggestions: I think we are getting closer. However:

> > Also, whilst you are hunting elephants, SQL has both % and ?
> > wildcards: % is any sequence of characters inlcuding the empty string
> > and ? is precicely one character.
> > So foo??bar would match fooABbar but not fooCDEbar and not fooFbar.
> > 
> > Any chance you could bend Magma's indices into managing expressions
> > with a fixed number of ?s (e.g. ?foo????bar??baz???)
> 
> Yes, there are several ways to do this but here's one.  A new index
> type that simply indexes a unique value for each character at a
> particular position.  Returning to the previous example, an object
> whose #familyName attribute is "fredfoon tobart", this index type would
> to index the object at the following values:
> 
>   (256 * 1) + $f asciiValue
>   (256 * 2) + $r asciiValue
>   (256 * 3) + $e asciiValue
>   (256 * 4) + $d asciiValue
>   (256 * 5) + $f asciiValue
>   (256 * 6) + $o asciiValue
>   (256 * 7) + $o asciiValue
>   (256 * 8) + $n asciiValue
>   (256 * 9) + Character space asciiValue
>   (256 * 10) + $t asciiValue
>   (256 * 11) + $o asciiValue
>   (256 * 12) + $b asciiValue
>   (256 * 13) + $a asciiValue
>   (256 * 14) + $r asciiValue
>   (256 * 15) + $t asciiValue
> 
> Then, whenever a single-character-matching string is specified, every
> character other than the question-marks becomes a conjunction in the
> query.  If the user is searching:
> 
>   where: [ : p | p familyName like: '????foon toba??' ]
> 
> then it would have to be parsed into the following regular clause:
> 
>   (familyName = ((256*5) + $f asciiValue)
>   & (familyName = ((256*6) + $o asciiValue)
>   & (familyName = ((256*7) + $o asciiValue)
>   & (familyName = ((256*8) + Character space asciiValue)
>   & (familyName = ((256*9) + $t asciiValue)
>   & (familyName = ((256*10) + $o asciiValue)
>   & (familyName = ((256*11) + $b asciiValue)
>   & (familyName = ((256*12) + $a asciiValue)
> 
> Notice the question marks are the ones on which we are not qualifying,
> so any value can be placed in positions 1-4 and 13-14.  
> 
Note: SQL actually uses '_' not '?' for a single wild character. Sorry: I continue to use '?' in this thread for consistency.

This all works except for expressions of the form: '???' i.e. where we are searching for all strings with exactly three characters.
I have yet to think of how to include this boundary condition except that we could additionally index on the size of the strings. In this case the MaHashIndex
would include a record with key = 3 and values = 'all string of length three'.

Finally, we also need to support expressions like '%foo%bar?b??t?'. I am hoping your earlier suggestions on searching for %'s can be merged into the same index
as the ?s.


Any ideas ?

Brent







More information about the Magma mailing list