Hi,
This email exchange about indexing is really worthwhile for all of us using Magma and unsure about the various implications with indexes, not to mention the discussion is archived and could be of good use for any new future user of Magma.
To be sure we talk about the same things (excuse the misconceptions from a non native English reader), I would like to reword to be sure I understand correctly the issues.
Of course, no problemo!
By two duplicate keys, do you mean two different index objects (in my example lexical term objects) producing the same hash value from indexHashForIndexObject: ?
Yes, that would be an example of a duplicate key. But hopefully, if you have two different lexical term objects with different values, your custom index definition would cause them to produce *different* hash values (e.g., the output from indexHashForIndexObject:).
Or do you mean two different objects indexed with an identical object? If so, why is it a problem? (I see it as a wished feature) And then I don't see how to avoid it.
This would be a duplicate key in the index as well. If, as you say, "two objects are indexed with an indentical object" then there is, by definition, ambiguity between those two objects w.r.t. to that indexed attribute.
The same thing would happen if your "index" was just a simple Dictionary of OrderedCollections.
(myIndex at: Color blue ifAbsentPut: [ OrderedCollection new ]) add: myObject
"Color blue" is the "index object" in this case, but as the OC at that index position grows and grows you can see the linear degradation..
Now, when executing a where: query, Magma will always search the smallest key-range of the conjuncted clauses automatically, but for *adding* new "blue" objects, the end of the OC must be "found" and added to which, after 1M duplicates, can really begin to degrade..
So if I understand correctly, it is a problem when the index range is limited in regard to the indexed object population extend. Frankly I think I do have this exact problem with indexes defined over a limit set of word. The object I am indexing are LOM metadata of pedagogical resources. The LOM norm comes with attributes defined from a fixed set of vocabulary. For example 'cost' is a two values attribute ('yes' or 'no'). In your scenario it is a worst case, but from the user point of view it is interesting to query only commercial (value 'yes') resources.
Yes, it certainly is interesting and useful to be able to query those things, and it *is* supported, just that it may eventually pose a scaling problem if 1M objects are all indexed at the hash value 'yes'...
An alternative would be to put the cost='yes' objects into one Collection, the cost='no's into another.. I have been asked by someone else for Magma to do this transparently to the Magma user (you are not the first to try to index on one of two values!), but other priorities have taken precedence thus far.
Is it a dead end?
I don't think so, but just be aware of it and the volume of duplicates you will have, a few is fine, millions or even ten-thousands is generally not.
Regards, Chris