Hi,
Concider a MagmaCollection with Person instances. The collection is indexed on #familyName and #dateOfBirth.
Magma currently supported very efficient indexed queries on these such indices, but only one index may be queried at a time.
Would it be possible to support complex queries on multiple indices ?
e.g. a) Find all people named 'Smith' and who were born in 1980 ? b) Find all people named 'Smith' or named 'Jones'
To my mind, this should be possible and quite efficient as the current MaCollectionReader code calculates the minimum and maximum hash values for each index. The Magma server then walks the index from the minimum value to the maximum value extracting the oids of each object.
I am not sure of the impact the readSize would have on this idea - at first glance it may be necessary to extract all the oids from each clause and cache the appropriate intersaction or union of the oids. The query could then return the pages from this cache.
Comments welcomed
Brent
Would it be possible to support complex queries on multiple indices ?
I have long been interested in how to solve this problem, but am not smart enough to do it myself w/o help. Thanks for provoking this discussion!
The way I envision this working from a user perspective is, the user "Reads" a collection on one index (and thus obtains a Reader), then can "intersect:" that Reader with another Reader to build yet a third Reader, which can be further intersected with other Readers as necessary.
There are three parts to the problem; 1) calculating the intersection/union, 2) creating a consistent Reader "view" of this intersection and 3) keeping the "view" maintained as various items eligibility change.
Perhaps if (1) and (2) can be done quickly, (3) may not be necessary.
Let me expound my thoughts.
1) Calculating the intersection
At the MaHashIndex level, calculating an intersection is very easy to implement using the "Bit Vectors" approach described at:
http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK3/NODE133.HTM
I was barely able to understand this, but the idea I got from it was each oid could represent a bit-position of a LargePositiveInteger. As we enumerate each index, we simply setBit: oid of our number. Yes, it will be a VERY large number but Squeak seems to handle it just fine (try it!). So we do this for each index and then simply bitAnd: the numbers together to get a single LargePositiveInteger which describes the intersection, nice!
It is potentially problematic, though, if someone wants to intersect two very huge ranges, because calculating the Integers requires enumerating the entire ranges.
.. Another idea involves building a second mapping going the other way; i.e., instead of just hashValue->oid, also insert oid->hashValue into a counterpart index.. I'll have to think more about this..
2) Creating a consistent Reader "view" of this intersection
The easiest way to tackle this is to use the existing Reader logic that's already built-in; therefore we must have the "intersected MaHashIndex" (or equivalent) as its back-end. From my view, this is the best and only way to leverage all the code that's there, which was painstakingly implemented to correctness.
The challenge is how do we know the user is done with the temporary, intersected MaHashIndex? The only answer I can think of at the moment is to require them to #recycle it or something (yuck).
3) Keeping it up to date Don't even want to worry about this right now unless we can solve 1 & 2...
Comments welcomed
Consider also that we have the ability to write a custom index that indexes on multiple attributes simultaneously. For example, for each person, you could insert two hash values into one index, one for the familyName and one for the dateOfBirth.
This does not solve the requirements, but I thought it worth mentioning to help us think "outside the box"; since relational-style indexing has it so ingrained in us that indexes must be uniform on one attribute..
More comments welcomed please..
The way I envision this working from a user perspective is, the user "Reads" a collection on one index (and thus obtains a Reader), then can "intersect:" that Reader with another Reader to build yet a third Reader, which can be further intersected with other Readers as necessary.
I think we need to concentrate on the elegance of the programming api: normal AND and ORs.
Something like the 'brush' model used by Seaside 2.6
| reader | reader := myCollection reader. "answer a MaCollectionReader" reader read: #familyName at: 'Smith'; and: #dateOfBirth from: 1972 asYear to: 1975 asYear; or: #gender at: 'female'.
reader do: [ :p | p doSomething ].
The idea is that each send of #read:at:, #and:from:to:, #or:at: to the reader contucts a tree of clauses. These clauses are then used to determine the intersection and union of oids.
I have some experience from Lava with conjuntive and dijunctive clauses but we would need to find a way to express precedence or the and: or: clauses.
But this kind of interface is my general idea.
At the MaHashIndex level, calculating an intersection is very easy to implement using the "Bit Vectors" approach described at:
http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK3/NODE133.HTM
Really light reading.
- Creating a consistent Reader "view" of this intersection
The easiest way to tackle this is to use the existing Reader logic that's already built-in; therefore we must have the "intersected MaHashIndex" (or equivalent) as its back-end. From my view, this is the best and only way to leverage all the code that's there, which was painstakingly implemented to correctness.
Agreed, see above.
The challenge is howp do we know the user is done with the temporary, intersected MaHashIndex? The only answer I can think of at the moment is to require them to #recycle it or something (yuck).
No idea yet.
- Keeping it up to date
Don't even want to worry about this right now unless we can solve 1 & 2...
No idea.
Comments welcomed
Consider also that we have the ability to write a custom index that indexes on multiple attributes simultaneously. For example, for each person, you could insert two hash values into one index, one for the familyName and one for the dateOfBirth.
Agreed, such synthesised hashes are a workaround.
Cheers
Brent
--- Brent Pinkney brent.pinkney@aircom.co.za wrote:
The way I envision this working from a user
perspective is, the user "Reads" a
collection on one index (and thus obtains a
Reader), then can "intersect:" that
Reader with another Reader to build yet a third
Reader, which can be further
intersected with other Readers as necessary.
I think we need to concentrate on the elegance of the programming api: normal AND and ORs.
Well yes, of course it would be great to have that.. Are you serious? Not only is that more familiar and elegant to programmers, it doesn't waste cycles like my approach which performs but a single intersection for each "condition".
I like the ambition, just a bit short on clarity about,
The idea is that each send of #read:at:, #and:from:to:, #or:at: to the reader contucts a tree of clauses. These clauses are then used to determine the intersection and union of oids.
particularly that last sentence. If you are able to do that, I am all for it.
No work needed to add an arbitrary-complex tree to the client server protocol beyond just specifying the new class in the #protocol..
I have some experience from Lava with conjuntive and dijunctive clauses but we would need to find a way to express precedence or the and: or: clauses.
Don't forget the joins.. ;)
- Chris
magma@lists.squeakfoundation.org