Chris,
This email exchange about indexing is really worthwhile for all of us using Magma and unsure about the various implications with indexes, not to mention the discussion is archived and could be of good use for any new future user of Magma.
To be sure we talk about the same things (excuse the misconceptions from a non native English reader), I would like to reword to be sure I understand correctly the issues.
Le mardi 19 mai 2009 à 17:32 -0500, Chris Muller a écrit :
I only meant to caution about the limited scale associated with duplicate keys in a MaHashIndexes. As duplicate keys accumulate it's a worst-case, O(n) linear search for adds, removes AND reads.
By two duplicate keys, do you mean two different index objects (in my example lexical term objects) producing the same hash value from indexHashForIndexObject: ?
Or do you mean two different objects indexed with an identical object? If so, why is it a problem? (I see it as a wished feature) And then I don't see how to avoid it.
It may be tempting to add index on the #color of millions of objects. Which might be worthwhile there are 24-million possible colors, but if only 8 of those colors are used then, as one-million objects are added to that collection, the minimum number of collisions at each key is 1M / 8.. Access to this index is now slow and will continue to worsen as more duplicates are added..
So if I understand correctly, it is a problem when the index range is limited in regard to the indexed object population extend. Frankly I think I do have this exact problem with indexes defined over a limit set of word. The object I am indexing are LOM metadata of pedagogical resources. The LOM norm comes with attributes defined from a fixed set of vocabulary. For example 'cost' is a two values attribute ('yes' or 'no'). In your scenario it is a worst case, but from the user point of view it is interesting to query only commercial (value 'yes') resources. Is it a dead end?
Best regards,
Hilaire