Fwd: Undirect index update

Wed May 20 14:58:23 UTC 2009

Hi,

> This email exchange about indexing is really worthwhile for all of us
> using Magma and unsure about the various implications with indexes, not
> to mention the discussion is archived and could be of good use for any
> new future user of Magma.
>
> To be sure we talk about the same things (excuse the misconceptions from
> a non native English reader), I would like to reword to be sure I
> understand correctly the issues.

Of course, no problemo!

> By two duplicate keys, do you mean two different index objects (in my
> example lexical term objects) producing the same hash value from
> indexHashForIndexObject: ?

Yes, that would be an example of a duplicate key.  But hopefully, if
you have two different lexical term objects with different values,
your custom index definition would cause them to  produce *different*
hash values (e.g., the output from indexHashForIndexObject:).

> Or do you mean two different objects indexed with an identical object?
> If so, why is it a problem? (I see it as a wished feature)
> And then I don't see how to avoid it.

This would be a duplicate key in the index as well.  If, as you say,
"two objects are indexed with an indentical object" then there is, by
definition, ambiguity between those two objects w.r.t. to that indexed
attribute.

The same thing would happen if your "index" was just a simple
Dictionary of OrderedCollections.

  (myIndex
    at: Color blue
    ifAbsentPut: [ OrderedCollection new ]) add: myObject

"Color blue" is the "index object" in this case, but as the OC at that
index position grows and grows you can see the linear degradation..

Now, when executing a where: query, Magma will always search the
smallest key-range of the conjuncted clauses automatically, but for
*adding* new "blue" objects, the end of the OC must be "found" and
added to which, after 1M duplicates, can really begin to degrade..

> So if I understand correctly, it is a problem when the index range is
> limited in regard to the indexed object population extend.
> Frankly I think I do have this exact problem with indexes defined over a
> limit set of word.
> The object I am indexing are LOM metadata of pedagogical resources.
> The LOM norm comes with attributes defined from a fixed set of
> vocabulary. For example 'cost' is a two values attribute ('yes' or
> 'no'). In your scenario it is a worst case, but from the user point of
> view it is interesting to query only commercial (value 'yes') resources.

Yes, it certainly is interesting and useful to be able to query those
things, and it *is* supported, just that it may eventually pose a
scaling problem if 1M objects are all indexed at the hash value
'yes'...

An alternative would be to put the cost='yes' objects into one
Collection, the cost='no's into another..  I have been asked by
someone else for Magma to do this transparently to the Magma user (you
are not the first to try to index on one of two values!), but other
priorities have taken precedence thus far.

> Is it a dead end?

I don't think so, but just be aware of it and the volume of duplicates
you will have, a few is fine, millions or even ten-thousands is
generally not.

Regards,
  Chris