Hi Brent!
I think you've cleverly solved your own question.. By indexing the reversed value, its no different than a keyword index where you can already query on the presence of multiple keywords, as in:
myArticles where: [ : each | (each keywords equals: 'Squeak') & (each keywords equals: 'Magma') ]
So you would just have to convert the wildcards to multiple conjunctions, for example, 'foo%bar' would just have to be translated to:
self people where: [ : each | (each familyName startsWith: 'foo') & (each familyName endsWith: 'bar') ]
and the index would have something like:
startsWith: aString self equals: aString
endsWith: aString self equals: aString reversed
You don't even need a separate "reversed" index, just add two entries (the forward and reversed) for each object added to the collection.
Now, the expression '%foo%bar%' goes beyond the first requirement of just searching on a prefix and/or a suffix.. For this a different type of index would be needed that adds every possible value for "Fredfoon Tobart", the following values:
fredfoon tobart redfoon tobart edfoon tobart dfoon tobart foon tobart oon tobart on tobart n tobart tobart tobart obart bart art rt t
Your '%foo%bar%' expression would then need to translate to
(familyName from: 'foo' to: 'foo' maAlphabeticalNext) & (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
to find the object(s) that had "Fredfoon Tobart".
...
The standard indexes included with Magma don't begin to scratch the surface of their real potential. I find it fascinating to to talk about this stuff, thanks!
It's great to "see" you again buddy, welcome back, Chris
----- Original Message ---- From: Brent Pinkney brent@zamail.co.za To: magma@lists.squeakfoundation.org Sent: Thursday, March 1, 2007 12:15:07 AM Subject: Searching magma string indices
Hi,
I have been thinking about how one could search Magma String indices for a given prefix or suffix. i.e. find all words starting with 'foo' or ending with 'bar'.
One can add SQL-ish "like" functionality for a prefix seach quite easily with the method:
MaClause >> #like: aString "aString is a SQL-ish term and may include a trailing %"
self from: aString upTo: (aString copyWithout: $%) maAlphabeticalNext !
For example: find all the people whose name starts with 'Jo':
self people where: [ :p | p familyName like: 'Jo%' ].
This executes as a very efficient indixed search and will be debuting (in some form) in the next release of Lava.
However, finding all the strings ending with 'bar' is more troublesome as there is no way to map this to a #from:to:# expression on the Magma index.
UNLESS of course, we maintain another, hidden, String index on the collection with the hash of the string reversed. That is
self people where: [ :p | p familyName like '%Jo' ].
becomes: self people read: #familyName_reversed from: 'oJ' to: ...
It is very late here, so I have not been able to try it out, but I welcome comments.
What would be really nice is an index on expressions of the form '%foo%bar%'. Somehow I doubt this.
Cheers
Brent
_______________________________________________ Magma mailing list Magma@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/magma
Hi Chris,
Hi Brent!
I think you've cleverly solved your own question.. By indexing the reversed value, its no different than a keyword index where you can already query on the presence of multiple keywords, as in:
myArticles where: [ : each | (each keywords equals: 'Squeak') & (each keywords equals: 'Magma') ]
So you would just have to convert the wildcards to multiple conjunctions, for example, 'foo%bar' would just have to be translated to:
self people where: [ : each | (each familyName startsWith: 'foo') & (each familyName endsWith: 'bar') ]
and the index would have something like:
startsWith: aString self equals: aString endsWith: aString self equals: aString reversed
You don't even need a separate "reversed" index, just add two entries (the forward and reversed) for each object added to the collection.
Domo arigato gozaimasu Sensei
Now, the expression '%foo%bar%' goes beyond the first requirement of just searching on a prefix and/or a suffix.. For this a different type of index would be needed that adds every possible value for "Fredfoon Tobart", the following values:
fredfoon tobart redfoon tobart edfoon tobart dfoon tobart foon tobart oon tobart on tobart n tobart tobart tobart obart bart art rt t
Your '%foo%bar%' expression would then need to translate to
(familyName from: 'foo' to: 'foo' maAlphabeticalNext) & (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
to find the object(s) that had "Fredfoon Tobart".
I must admit I am not following you here. Is this one of those Magma keyword indices which I have never managed to grok ?
Also, whilst you are hunting elephants, SQL has both % and ? wildcards: % is any sequence of characters inlcuding the empty string and ? is precicely one character. So foo??bar would match fooABbar but not fooCDEbar and not fooFbar.
Any chance you could bend Magma's indices into managing expressions with a fixed number of ?s (e.g. ?foo????bar??baz???)
It is to darn hot here to think.
Brent
Hi Brent,
Now, the expression '%foo%bar%' goes beyond the first requirement
of just searching on a prefix and/or a suffix.. For this a different type of index would be needed that adds every possible value for "Fredfoon Tobart", the following values:
fredfoon tobart redfoon tobart edfoon tobart dfoon tobart foon tobart oon tobart on tobart n tobart tobart tobart obart bart art rt t
Your '%foo%bar%' expression would then need to translate to
(familyName from: 'foo' to: 'foo' maAlphabeticalNext) & (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
to find the object(s) that had "Fredfoon Tobart".
I must admit I am not following you here. Is this one of those Magma keyword indices which I have never managed to grok ?
Yes. As you know, even though many indexes only map one value per indexed attribute, any index can just as easily map multiple values per object, per indexed attribute.
Have a look at MaClause>>#shouldInclude:using:, you can very simply how it treats multiple values; if any one of the (multiple) indexed values for a particular object are "in range", then the entire object qualifies for that clause.
It's more generic than a "keyword" index, but a keyword index is an easy way to grok it, because you know when you search for a keyword, any of the keywords assigned to an object will cause that object to match.
So in the above example, searching the first clause
(familyName from: 'foo' to: 'foo' maAlphabeticalNext)
is qualified on the 5th element
foon tobart
and the second clause
& (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
is qualifie by the 12th element:
bart
therefore the entire expression is satisfied and the object qualifies.
Also, whilst you are hunting elephants, SQL has both % and ? wildcards: % is any sequence of characters inlcuding the empty string and ? is precicely one character. So foo??bar would match fooABbar but not fooCDEbar and not fooFbar.
Any chance you could bend Magma's indices into managing expressions with a fixed number of ?s (e.g. ?foo????bar??baz???)
Yes, there are several ways to do this but here's one. A new index type that simply indexes a unique value for each character at a particular position. Returning to the previous example, an object whose #familyName attribute is "fredfoon tobart", this index type would to index the object at the following values:
(256 * 1) + $f asciiValue (256 * 2) + $r asciiValue (256 * 3) + $e asciiValue (256 * 4) + $d asciiValue (256 * 5) + $f asciiValue (256 * 6) + $o asciiValue (256 * 7) + $o asciiValue (256 * 8) + $n asciiValue (256 * 9) + Character space asciiValue (256 * 10) + $t asciiValue (256 * 11) + $o asciiValue (256 * 12) + $b asciiValue (256 * 13) + $a asciiValue (256 * 14) + $r asciiValue (256 * 15) + $t asciiValue
Then, whenever a single-character-matching string is specified, every character other than the question-marks becomes a conjunction in the query. If the user is searching:
where: [ : p | p familyName like: '????foon toba??' ]
then it would have to be parsed into the following regular clause:
(familyName = ((256*5) + $f asciiValue) & (familyName = ((256*6) + $o asciiValue) & (familyName = ((256*7) + $o asciiValue) & (familyName = ((256*8) + Character space asciiValue) & (familyName = ((256*9) + $t asciiValue) & (familyName = ((256*10) + $o asciiValue) & (familyName = ((256*11) + $b asciiValue) & (familyName = ((256*12) + $a asciiValue)
Notice the question marks are the ones on which we are not qualifying, so any value can be placed in positions 1-4 and 13-14.
Does this all make sense?
It is to darn hot here to think.
Yeah, you're probably just already missing skiing down fresh powder.. :)
Hi Chris,
Thanks for your suggestions: I think we are getting closer. However:
Also, whilst you are hunting elephants, SQL has both % and ? wildcards: % is any sequence of characters inlcuding the empty string and ? is precicely one character. So foo??bar would match fooABbar but not fooCDEbar and not fooFbar.
Any chance you could bend Magma's indices into managing expressions with a fixed number of ?s (e.g. ?foo????bar??baz???)
Yes, there are several ways to do this but here's one. A new index type that simply indexes a unique value for each character at a particular position. Returning to the previous example, an object whose #familyName attribute is "fredfoon tobart", this index type would to index the object at the following values:
(256 * 1) + $f asciiValue (256 * 2) + $r asciiValue (256 * 3) + $e asciiValue (256 * 4) + $d asciiValue (256 * 5) + $f asciiValue (256 * 6) + $o asciiValue (256 * 7) + $o asciiValue (256 * 8) + $n asciiValue (256 * 9) + Character space asciiValue (256 * 10) + $t asciiValue (256 * 11) + $o asciiValue (256 * 12) + $b asciiValue (256 * 13) + $a asciiValue (256 * 14) + $r asciiValue (256 * 15) + $t asciiValue
Then, whenever a single-character-matching string is specified, every character other than the question-marks becomes a conjunction in the query. If the user is searching:
where: [ : p | p familyName like: '????foon toba??' ]
then it would have to be parsed into the following regular clause:
(familyName = ((256*5) + $f asciiValue) & (familyName = ((256*6) + $o asciiValue) & (familyName = ((256*7) + $o asciiValue) & (familyName = ((256*8) + Character space asciiValue) & (familyName = ((256*9) + $t asciiValue) & (familyName = ((256*10) + $o asciiValue) & (familyName = ((256*11) + $b asciiValue) & (familyName = ((256*12) + $a asciiValue)
Notice the question marks are the ones on which we are not qualifying, so any value can be placed in positions 1-4 and 13-14.
Note: SQL actually uses '_' not '?' for a single wild character. Sorry: I continue to use '?' in this thread for consistency.
This all works except for expressions of the form: '???' i.e. where we are searching for all strings with exactly three characters. I have yet to think of how to include this boundary condition except that we could additionally index on the size of the strings. In this case the MaHashIndex would include a record with key = 3 and values = 'all string of length three'.
Finally, we also need to support expressions like '%foo%bar?b??t?'. I am hoping your earlier suggestions on searching for %'s can be merged into the same index as the ?s.
Any ideas ?
Brent
Hi Chris,
The last time we spoke we were trying to support the equivent of String >> #match: in Magma. This is equivalent to 'like' expression in Lava/SQL.
We made the following observations. I describe the problem I think we will have and possible solution is suggested afterwards with my issues.
1. Proper Prefixes ------------------- Expressions of the sort: where: [ :p | p familyName match: 'foo*' ] can be supported by the existing MaSearchStringIndex by mapping the clause to
(familyName from: 'foo' to: 'foo' maAlphabeticalNext).
2. Proper Suffixes ------------------- Expressions of the sort: where: [ :p | p familyName match: '*bar' ] can be supported a subclass of MaSearchStringIndex, which adds two hashes for each string in the MagmaCollection: the normal hash for the string, and an addtional one for the string reversed. The query would be mapped to
(familyName from: 'foo' reversed to: 'foo' reversed maAlphabeticalNext).
3. Complex Wildcards ------------------------ Expressions of the sort: where: [ :p | p familyName match: '*foo*bar*' ] can be supported by adding hashes for each proper substring of the entry. e.g. To hash 'foo to bart' we would add hashes for: 'foo to bart' 'oo to bart' 'o to bart' ' to bart' 'to bart' 'o bart' ' bart' 'bart' 'art' 'rt' 't'
The query '*foo*bar*' would be tranformed into the intersection of:
(familyName from: 'foo' to: 'foo' maAlphabeticalNext) & (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
4. Single Character Wildcards --------------------------------- Expressions of the sort: where: [ :p | p familyName match: 'foo#bar' ] can be supported by adding hashes which are a function of each characters value and position in the string. e.g. To hash 'foo' we would add hashes for: (256 * 1) + $f asciiValue (256 * 1) + $f asciiValue (256 * 1) + $f asciiValue
Whenever a single-character-matching string is specified, every character other than the $# characgters becomes a conjunction in the query.
The query '*foo*bar*' would be tranformed into the intersection of
(familyName = ((256*1) + $f asciiValue) & (familyName = ((256*2) + $o asciiValue) & (familyName = ((256*3) + $o asciiValue) & (familyName = ((256*5) + $b asciiValue) & (familyName = ((256*5) + $a asciiValue) & (familyName = ((256*5) + $r asciiValue)
THE PROBLEM ------------------ The naive solution to this is to create a new subclass of MaSearchStringIndex (say MaMatchingStringIndex) which addes hashes for all the scenarios above into one MaHashIndex.
If the collection with this index has two elements 'footobar' and 'bartobaz' then the expression 'bar*' will incorrectly match to BOTH elements!
There will be a hash entry for the prefix 'bar' --> 'footobar' by 3) above in addition to the normal hash entry for 'bartobaz'.
THE SOLUTION ------------------ My current solution is for the new MaMatchingStringIndex class to add THREE indexes to a collection instead of the normal one index one would expect from the sample:
myCollection addIndex: (MaMatchingStringIndex new attribute: #familyName) ....
The first index would add two hashes for each string in the collection: one for the string and one for the reversed string. This would support 1) and 2) above. The second index would add hashes for the complex wildcard patterns in 3) above. The third index would add hashes for the single character wildcard patterns in 4) above.
ISSUES --------- 1) Is there no simpler solution I am missing. 2) If not, Magma would have to change #addIndex: to double dispatch back to the MaCollectionIndex so it can add itself (or three indexes) to the collection. 3) Are there issues with the -keys indexes that Magma addes by itself for each index ? 4) Is this stuff you would concider adding to Magma or is it for Lava only ?
If I am reincarnated, I am going to do something easy like brain surgery.
Brent
Hey!
- Complex Wildcards
Expressions of the sort: where: [ :p | p familyName match: '*foo*bar*' ] can be supported by adding hashes for each proper substring of the entry. e.g. To hash 'foo to bart' we would add hashes for: 'foo to bart' 'oo to bart' 'o to bart' ' to bart' 'to bart' 'o bart' ' bart' 'bart' 'art' 'rt' 't'
The query '*foo*bar*' would be tranformed into the intersection of:
(familyName from: 'foo' to: 'foo' maAlphabeticalNext) & (familyName from: 'bar' to: 'bar' maAlphabeticalNext)
I assume you want the above 'foo to bart' not be found with "*bar", because it doesn't END WITH "bar", it ends with "bart". BUT, searching for suffixes with index-type #3 could translate the wildcard slighlty differently, instead of the translation for "*bar*",
(familyName from: 'bar' to: 'bar' maAlphabeticalNext)
the translation for "*bar" would have to use an equals:
(familyName equals: 'bar')
So we may not need solution #2 at all..?
- Single Character Wildcards
Expressions of the sort: where: [ :p | p familyName match: 'foo#bar' ] can be supported by ...
(familyName = ((256*1) + $f asciiValue) & (familyName = ((256*2) + $o asciiValue) & (familyName = ((256*3) + $o asciiValue) & (familyName = ((256*5) + $b asciiValue) & (familyName = ((256*5) + $a asciiValue) & (familyName = ((256*5) + $r asciiValue)
Yeah, but you pointed out last time that you needed
to support expressions like '%foo%bar?b??t?'
so this index-type #4 seems insufficient to meet that requirement because the % wildcards matching multiple characters will throw off the entire positioning. Frankly, that sort of a query really stretches my mental capacity, I don't know whether I'd attempt a query like that.. Still, a different approach to single-character wildcard matching might help. By further enhancing the "wildcard translation" of solution #3 we might be able to better support single-character wildcard matching. Consider we're looking with 'foo t? b?rt':
(thinking out loud here)
- scan through the "like" string, get all the pieces between wildcards: #('foo t' ' b' 'rt')
- for all but the right-most, use the standard wildcard range:
& (familyName between: 'foo t' and: 'foo t' alphabeticalNext)
- but for the right-most one, use equals:
& (familyName equals: 'rt')
Sigh.. This obviously isn't water tight.. It doesn't account for the *order* in which the elements appear, only that they all appear somewhere. Still, that index-type #3 really offers a lot of bang for the buck. I'm sure we will indeed end up needing more than one underlying index-type, I'm just not sure what.. and I have no problem double-dispatching to the index, back to the collection to add itself (themselves).
I'd be surprised and disappointed if it is proven "impossible" given the considerable flexibility of Magma's indexing. We probably just need a stroke of creativity or genius to figure it out.
ISSUES
- Is there no simpler solution I am missing.
If there is I'm not seeing it right now..
- If not, Magma would have to change #addIndex: to double dispatch back to the MaCollectionIndex so it can add itself (or three indexes) to the collection.
No problem. But lets first figure out exactly what indexes we need..
- Are there issues with the -keys indexes that Magma addes by itself for each index ?
We'll have to make sure each underlying "sub-index" is by a different "private" attribute. It might be tricky, hopefully not.
- Is this stuff you would concider adding to Magma or is it for Lava only ?
If its general, absolutely.
- Chris
Hi!
I haven't read or followed the discussion, but...
I'd be surprised and disappointed if it is proven "impossible" given the considerable flexibility of Magma's indexing. We probably just need a stroke of creativity or genius to figure it out.
...what about taking a look at say Xapian (seems to be the leading indexing engine) and take a hint from them? Here is one pointer to their new "Flint" backend:
http://wiki.xapian.org/FlintBackend_2fStructure
regards, Göran
PS. Swish-e which we use in Gjallar points to Xapian as indexing engine for the future Swish-e version 3.
On 3/28/07, Chris Muller asqueaker@gmail.com wrote:
Consider we're looking with 'foo t? b?rt':
(thinking out loud here)
- scan through the "like" string, get all the pieces between
wildcards: #('foo t' ' b' 'rt')
for all but the right-most, use the standard wildcard range:
& (familyName between: 'foo t' and: 'foo t' alphabeticalNext)
but for the right-most one, use equals:
& (familyName equals: 'rt')
Sigh.. This obviously isn't water tight.. It doesn't account for the *order* in which the elements appear, only that they all appear somewhere. Still, that index-type #3 really offers a lot of bang for the buck. I'm sure we will indeed end up needing more than one underlying index-type, I'm just not sure what.. and I have no problem double-dispatching to the index, back to the collection to add itself (themselves).
Actually, if you used that scheme to get the candidates that could be valid, and then ran one last check using the equivalent of #match: (beefed up for single character wildcards - need to write that) on the candidates, that would be water tight. That is, pick up the possible candidates, get the actual value form the object, and see if that attirbute of the object actually matches the pattern as passed in. This way you get most of the benefit of the index AND you can make sure it is right.
At least, that should work. I haven't delved into the innards of Magma to even know if that is possible.
-ChrisC
On 3/1/07, Chris Muller chris@funkyobjects.org wrote:
Hi Brent!
I think you've cleverly solved your own question.. By indexing the reversed value, its no different than a keyword index where you can already query on the presence of multiple keywords, as in:
...
You don't even need a separate "reversed" index, just add two entries (the forward and reversed) for each object added to the collection.
Actually, wouldn't this give you wrong results? That is, if in the normal collection you had "foobar" and "rabbit" and searched for '%bar', I think you would get back both "foobar" and "rabbit" (since it would match on raboof and rabbit, right?).
So, for safety, I would still suggest having 2 collections - or maybe tagging the collection with a notice if it is a reversed value or not.
-ChrisC
Yes, you are right. But the "reversed" index is not needed anyway, since Brents solution 3 "complex wildcards" would fit the bill for finding suffixes..
Also, you really said it all right here:
Actually, if you used that scheme to get the candidates that could be valid, and then ran one last check using the equivalent of #match: (beefed up for single character wildcards - need to write that) on the candidates, that would be water tight. That is, pick up the possible candidates, get the actual value form the object, and see if that attirbute of the object actually matches the pattern as passed in. This way you get most of the benefit of the index AND you can make sure it is right.
This is something that everyone has asked for and what I would really like for Magma to have myself.. Unfortunately, I did not find a way to implement it within the Readers; their paging information maintained in the client image represents the pages as seen by the server.. And the server doesn't (necessarily) have the domain classes present in the image to be able to perform the final select block filtering..
Thanks, Chris
On 3/30/07, Chris Cunningham cunningham.cb@gmail.com wrote:
On 3/1/07, Chris Muller chris@funkyobjects.org wrote:
Hi Brent!
I think you've cleverly solved your own question.. By indexing the
reversed value, its no different than a keyword index where you can already query on the presence of multiple keywords, as in:
...
You don't even need a separate "reversed" index, just add two entries (the
forward and reversed) for each object added to the collection.
Actually, wouldn't this give you wrong results? That is, if in the normal collection you had "foobar" and "rabbit" and searched for '%bar', I think you would get back both "foobar" and "rabbit" (since it would match on raboof and rabbit, right?).
So, for safety, I would still suggest having 2 collections - or maybe tagging the collection with a notice if it is a reversed value or not.
-ChrisC
Magma mailing list Magma@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/magma
magma@lists.squeakfoundation.org