Hi Chris and Brent!
Ok, so our project is moving along fine - it will most probably become open source, I intend to prepare a demonstration for OOPSLA and see if it gets acepted (and tell everyone about Seaside/Magma of course).
In the app we are building we intend to offer the users advanced filtering of "cases". This boils down to more or less what you have been discussing about indices and Lava (though I am at this point not interested in Lava - but will definitely be later on).
So my question is: What is the current status of this work? And where can I get it? :)
I have read through your posts on the subject and it looks great to me.
regards, Göran
Hi Goran,
I am still working on this with Chris and was just this very week going to solicit your input on my current thoughts but managed to concuss myself yesterday!
It is both Chris and my opinion that this functionality is critical to Magma.
The issue with reading multiple indices actually has nothing, I mean nothing, to do with Lava. I think the connection has been made bacause a) I am the creator of Lava b) am investigating queries over multiple indices and c) the technology will obviously be very useful to Lava.
The project actually arose for a pure Seaside + Magma project.
The state of the problem is that Chris and I have agreed on the client interface which uses a DNU to generate a tree of clauses and have made some early attempts.
e.g.
query := people where: [ :p | p familyName = 'Foo' & p age between: 30 and: 40 ] orderBy: [ :p familyName, gender ].
query do: [ :p | self doSomethingWith: p ].
As for the server side, I made a naive implementation which will not scale but does work. I can forward this to you with instructions if you wish.
Chris then made some conciderable progress on a real implementation. His insight was to add a reverse or 'census' index for each Magma index.
A census index maps oid -> key. With this in place and the fact that a traditional Magma index knows how many oids satisfy a key range (eg. age between: 30 and: 40), Magma can parse the tree of query clauses and determine (trivially) which clauses maps to the fewest oids. Each of these oid can then be looked up in the census indices for all the other indices mentioned in the query and determine which oids will survive.
This solution also preserves the Magma feature where the size of the result set is known in advance of iterating all the results. Chris and I call this feature 'laziness'. This is very handy for batched lists in Seaside.
For purely conjunctive queries without search clauses, this solution is complete. Chris made the Monticellos available to me which demonstrate this.
However, the problem is a lot harder for queries with disjunctions and/or sort criteria. Becuase an oid can occur in more than one clause, it is not possible to trivially know the size of the result set. Obviously #maximumPossibleSize is known and this may be acceptable in some cases. Further, the object will occur twice when the result set is enumerated.
The obvious solution is some bitmap index which is constructed by iterating over the results for each clause and anding and or-ing them appropraitely. For very large collections where one or more clauses read a large portiion of the keyspace (eg. familyName from: 'a' to 'z'), the solution becomes very, very, slow as the results cannot be retrieved a page at a time.
Sort criteria further complicate the problem, as a simple induction proof will show that the entire result set must be interrogated at least once to find the order.
-------------------------------------------- This was the state when Chris and I last spoke. I believe the aforementioned issues define the problem and are therefore unavoidable. However I think/hope that the problem can be managed by adding two properties to the solution:
1. a timeout 2. a new Magma index defining the position of an oid in each of the other Magma indices (not the census indices).
1. Timeout -------------- Squeak has valueWithin:onTimeout: and nice Notifications/resumable exceptions.
In addition, the Magma server will know the upper bound on how long it will take to populate the bitmap as we know the number of oids in each term. In addition Chris showed in this message how a given Magma server get some idea of long it will take.
See http://lists.squeakfoundation.org/pipermail/magma/2005-December/000102.html
IMHO the appliations I write are either all batch jobs which have time to burn (within reason) or are interctive applications where the user can be interrogated to refine the query or agree to wait the estimated time it will take Magma to prepare the query.
2. A Sorting Index ----------------------- I think I have a solution to sorting which will be very quick in an interactive application and at least decent in the general sense.
It turns out that a Magma index hold entries which map an oid to more than one key. ie. oid -> { key1, key2, ... }. We can exploit this to add a 'sorting index' to each collection. The oid of each object in the collection is the key and the value is a collection of integers. Each integer is the position of that oid (object) in the other indices search order.
That being the case, when Magma is constructing the bitmap index, it can hold an odered collection of position -> oid associations.
Now the cunning part is that we do not have to keep an association for every oid in the result, only the first few pages. This is useful for in interactive app as I have heard it said that most folk only look at the first page or so. This is because a Magma index is maintained in sort order, so if we want two pages of 10 results and the lowest oid was at position 512 in the sort index and the 20th was in position 14567, then we know that the 21st must be in position 14568 or higher.
For these latter pages we can can just read the sort index and chech whether each oid is in the bitmap, if it is, then it is the 'next' element.
If this takes to long then we timeout and let the application decide what to do. Smalltalk will let us continue where we left off if we want.
Sample Code: -----------------
Example 1 - bitmap required:
| query | query := people where: [ :p | p familyName = 'Pinkney' & p age < 60 ] orderBy: [ p age descending ].
[ query prepareWithin: 10 seconds ] on: MaQueryTakesTooLong do: [ .... ].
query do: [ :p | self doSomeThingWith: p ].
Example 2 - no bitmap required as duplicates are allowed
| query | query := people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ].
[ query allowDuplicates; prepare ] on: MaQueryTakesTooLong do: [ ... ].
query size.
Example 3 - bitmap required as duplicates are prohibited:
| query | query := people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ].
[ query prepare ] on: MaQueryTakesTooLong do: [ ... ].
query size.
Example 4 - implicit preparation for convenience, also ignore timeout warnings
( people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ] ) do: [ :p | self doSomethingWith: p ] ]
Conclusion -------------- I was going to posh my ideas up but the sore head and Goran's (I cannot get my keyboard to do an umlaut, help) sudden interest has encouraged me to sent publish my intentions on the list.
All comments are welcome as I REALLY need this functionality!
Cheers
Brent
Hi Brent and Chris! (and all others)
Again, this all sounds dandy to me. :) Can you guys point me to the appropriate no-promises-MCs for me to play with?
Yesterday I upgraded our dev image to MagmaServerLoader-cmm.16, and AFAICT our app still works. :)
Since I need to give our "filters" a real stab the next two weeks I take anything you have so far.
regards, Göran
Hi,
Again, this all sounds dandy to me. :) Can you guys point me to the appropriate no-promises-MCs for me to play with?
Take a basic 3.8 image and load the following SqueakSource packages:
Magma1.0-cmm.4 (or higher ymmv) Lava-brp.28 Lava testing-brp.23
After this load the attached Magma research-brp.12 archive.
The Lava packages are only required to provide some test data.
1. Open a workspace and create the test repository:
LavaTestResource current.
2. Explore the following query expression:
query := LavaTestResource current people where: [ :p | (p familyName equals: 'Pinkney') & (p age between: 30 and: 60) ] orderBy: [ :p | p age ].
3. Execute the quey by asserting the following:
query size = 2
Explore the query to see the simple bitmap used to hold the results of the query. This naive implementation will not scale but it is the departure point for a real solution.
I will foward Chris mail on how to play with his partial solution which I have described elsewhere in this thread.
Please feel free to badger me on this problem - we need a good solution.
Brent
Hi Goran,
This ws Chris's message to me (I hope Chris does not mind) on how to try out his first pass at a query enginer which uses the census indices to narrow the list of possible oids.
This solution is prone to the limitations I described elsewhere in this thread.
---------- Forwarded Message ----------
Subject: multiple indices - first optimized querying !! Date: Saturday 15 April 2006 00:06 From: Chris Muller chris@funkyobjects.org To: Brent Pinkney brent.pinkney@aircom.co.za
I've finally caught up to you Brent. I am full into this iteration, I have therefore merged all of the Magma-research package into appropriate real Magma packages.
Please load the following from SqueakSource on top of a Magma 1.0 image:
Ma base additions-cmm.63 Ma special collections-cmm.63 Magma client-cmm.180 Magma server-cmm.156
Then you (should) be able to execute the same where: expressions except it will now use optimized algorithm that calculates the minimum "QueryTrunk" that must be enumerated.
Note I have only performed very minimal testing; two or three queries that seem to work.
To support optimized queries, we now must maintain TWO MaHashIndexes for each index instead of just one. Regrettably, this will slow down the insert rate, one of the slowest things Magma does anyway (despite our hard work to optimize it last year). Many of the changes in MagmaCollectionManager are to account for the new oid->key index mapping. I nevertheless, think the reward of supporting complex queries will make it worth it.
Please note that, so far, this implementation dynamically calculates a query result, it does NOT require ANY bitmap as yet, but at the expense of some side-effects:
- sorting is not provided (yet) - the paging is cannot be perfect, there will be some overlap between the pages (note this could happen anyway because the result set can change while the user is paging). This happens because we only start at a specified page/startIndex of the *trunk*, not the actual query result.
If you happen to see how it works in the debugger, you will doubless understand new possibilities for our "bitmap" solution. I think we could possibly get away with simply recording results dispersion for the pages the user has visited.
One interesting thing to look it is 'partial ordering' as used by Google. They find the 'best' results and just order the first two pages. Just a thought though.
Well, I think this is a great thought, because that is exactly what we can do right now; we cannot provide sorted results like a relational database where page one contains the lowest key in the entire result set. If we can sort just the page that we have the *client* could do that work. If we can figure out how to do it relational style I'm all for it, but do you think we can maintain this dynamism we have now?
Enjoy! Chris
-------------------------------------------------------
Brent
magma@lists.squeakfoundation.org