Multiple indices on MagmaCollections?

Brent Pinkney brent.pinkney at aircom.co.za
Thu May 18 17:05:18 UTC 2006


Hi Goran,

I am still working on this with Chris and was just this very week going to solicit your input on my current thoughts but managed to concuss myself yesterday!

It is both Chris and my opinion that this functionality is critical to Magma.

The issue with reading multiple indices actually has nothing, I mean nothing, to do with Lava. 
I think the connection has been made bacause a) I am the creator of Lava b) am investigating queries over multiple indices 
and c) the technology will obviously be very useful to Lava.

The project actually arose for a pure Seaside + Magma project.

The state of the problem is that Chris and I have agreed on the client interface which uses a DNU to generate a tree of clauses and have made some early attempts.

e.g.

	query := people
				where: [ :p | p familyName = 'Foo' & p age between: 30 and: 40 ]
				orderBy: [ :p familyName, gender ].

	query do: [ :p | self doSomethingWith: p ].


As for the server side, I made a naive implementation which will not scale but does work. I can forward this to you with instructions if you wish.

Chris then made some conciderable progress on a real implementation. His insight was to add a reverse or 'census' index for each Magma index.

A census index maps oid -> key. With this in place and the fact that a traditional Magma index knows how many oids satisfy a key range (eg. age between: 30 and: 40),
Magma can parse the tree of query  clauses and determine (trivially) which clauses maps to the fewest oids. Each of these oid can then be looked up in the census indices
for all the other indices mentioned in the query and determine which oids will survive.

This solution also preserves the Magma feature where the size of the result set is known in advance of iterating all the results. Chris and I call this feature 'laziness'. 
This is very handy for batched lists in Seaside.

For purely conjunctive queries without search clauses, this solution is complete. Chris made the Monticellos available to me which demonstrate this.

However, the problem is a lot harder for queries with disjunctions and/or sort criteria. Becuase an oid can occur in more than one clause, it is not possible to trivially know 
the size of the result set. Obviously #maximumPossibleSize is known and this may be acceptable in some cases. Further, the object will occur twice when the result set is
enumerated.

The obvious solution is some bitmap index which is constructed by iterating over the results for each clause and anding and or-ing them appropraitely. 
For very large collections where one or more clauses read a large portiion of the keyspace (eg. familyName from: 'a' to 'z'), the solution becomes very, very, slow as the 
results cannot be retrieved a page at a time.

Sort criteria further complicate the problem, as a simple induction proof will show that the entire result set must be interrogated at least once to find the order.

--------------------------------------------
This was the state when Chris and I last spoke. I believe the aforementioned issues define the problem and are therefore unavoidable. However I think/hope that the 
problem can be managed by adding two properties to the solution:

	1. a timeout
	2. a new Magma index defining the position of an oid in each of the other Magma indices (not the census indices).

1. Timeout
--------------
Squeak has valueWithin:onTimeout: and nice Notifications/resumable exceptions.

In addition, the Magma server will know the upper bound on how long it will take to populate the bitmap as we know the number of oids in each term. In addition Chris
showed in this message how a given Magma server get some idea of long it will take.

See
	http://lists.squeakfoundation.org/pipermail/magma/2005-December/000102.html

IMHO the appliations I write are either all batch jobs which have time to burn (within reason) or are interctive applications where the user can be interrogated to refine the 
query or agree to wait the estimated time it will take Magma to prepare the query.

2. A Sorting Index
-----------------------
I think I have a solution to sorting which will be very quick in an interactive application and at least decent in the general sense.

It turns out that a Magma index hold entries which map an oid to more than one key. ie. oid -> { key1, key2, ... }. We can exploit this to add a 'sorting index' to each collection.
The oid of each object in the collection is the key and the value is a collection of integers. Each integer is the position of that oid (object) in the other indices search order.

That being the case, when Magma is constructing the bitmap index, it can hold an odered collection of position -> oid associations. 

Now the cunning part is that we do not have to keep an association for every oid in the result, only the first few pages. This is useful for in interactive app as I have heard it 
said that most folk only look at the first page or so. This is because a Magma index is maintained in sort order, so if we want two pages of 10 results  and the lowest oid was at 
position 512 in the sort index and the 20th was in position 14567, then we know that the 21st must be in position 14568 or higher.

For these latter pages we can can just read the sort index and chech whether each oid is in the bitmap, if it is, then it is the 'next' element.

If this takes to long then we timeout and let the application decide what to do. Smalltalk will let us continue where we left off if we want.

Sample Code:
-----------------

Example 1 - bitmap required:

        | query |
        query := people where: [ :p | p familyName = 'Pinkney' & p age < 60 ] orderBy: [ p age descending ].

        [ query prepareWithin: 10 seconds ]
                on: MaQueryTakesTooLong
                do: [ .... ].

        query do: [ :p | self doSomeThingWith: p ].

Example 2 - no bitmap required as duplicates are allowed

        | query |
        query := people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ].

        [ query allowDuplicates; prepare ]
                on: MaQueryTakesTooLong
                do: [ ... ].

        query size.

Example 3 - bitmap required as duplicates are prohibited:

        | query |
        query := people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ].

        [ query prepare ]
                on: MaQueryTakesTooLong
                do: [ ... ].

        query size.

Example 4 - implicit preparation for convenience, also ignore timeout warnings

        ( people where: [ :p | p familyName between 'Man' and: 'Pinkney' | p familyName = 'Mandela' ] )
                do:
                [ :p |
                self doSomethingWith: p ] ]

Conclusion
--------------
I was going to posh my ideas up but the sore head and Goran's (I cannot get my keyboard to do an umlaut, help) sudden interest has encouraged 
me to sent publish my intentions on the list.

All comments are welcome as I REALLY need this functionality!

Cheers

Brent















More information about the Magma mailing list