MagmaCollectionReader behavior

Miguel Enrique Cobá Martínez miguel.coba at gmail.com
Wed Apr 1 06:06:23 UTC 2009


Chris Muller wrote:

First, I didn't mean to be rude. I have Magma in a very high regard and 
it is one of the most interesting projects and codebase I have ever 
learn from by reading the code. I just think that the example in the 
wiki page cited can make the users expect a very distinct behavior than 
the actual one.

See it this way. Most of the developers, including me, have used RDBMs 
for almost all our professional work. The change to an ODB can be at 
first very appealing, and at the same time, very dificult to grasp.

To the point. When we hear about indexes we, without exception, think 
about RDBSs indexes, that we create for a column in a table and then, 
where querying the table and if there is only a record with the value 
that we are looking for, the query returns only that value. No need to 
especify other parameters for this to work. This is the way that we 
expect them to work. Call it inertia, if you want.

Now, suppose that one of this developers, think of me again, want to 
know what's this thing about ODBs, and finds the wiki page mentioned, 
follows it and reads that the collection has an index over the #email 
instance variable. Without knowing the conceptual diferences between the 
indexes in RDBMs and Magma, he (me) will asume that for a single value 
on the collection, a search for that value will give a single result.

That is not the case, and that is the source of my confusion. I think 
that in the wiki page (that for sure is one of the most visited of the 
magma related ones) a warning should be made to the developer pointing that:

- he needs to create the index with a greater keySize in order to 
alleviate the problem with the finite key space and the impact that a 
low keySize has over it.
- he needs to read in more deep the index wiki page to understand the 
features and the differences with RDBMs indexes.
- he needs to think about querys in a very distinct way when dealing 
with ODBs than with RDBMs or/and he needs to massage or prepare the data 
to index in order to have the correct results (like the query for the 3 
part email index you propose below)


> Hi Miguel, MagmaCollections can absolutely
> 
>> give accurate and exact results and by
>> itself
> 

I don't have any doubt, but depends, in this case of the keySize.

> , *can* be used for finding objects.  You *don't* always
> 
>> have to apply some
>> kind of searching over the already reduced collection represented by
>> MagmaCollectionReader in order to find the *exact* match you are trying to
>> locate.
> 
> The key concept you've stumbled on here is that MagmaCollectionIndexes
> do only provide a *finite* key space.  But how you decide to utilize
> that key-space, e.g., convert your objects into an integral key value,
> as well as the size of the key-space in bits, determines whether
> duplicate keys will occur or not.

I think that this is a point where the requirements for the developers 
can be a little too much. See it this way, we are only indexing a 
string. It shouldn't be necessary to implement a hash code for 
converting them to numbers. Besides, most of us, or at least me, are not 
experts in the hashing techniques and for sure we can't be certain that 
the hash is correct or that don't clash.

> 
> If you want to use a big, fat, e-mail address as a "unique
> identifier", you will be better served to use most of a 64 or 128-bit
> key-space than a small percentage of a 400-bit key-space.  Using only
> the alpha range:
> 
>     (MaSearchStringIndex attribute: #email) keySize: 128; beAlpha; yourself

This line it is perfect for the wiki page, instead of the current one, 
because it will give the expected results most of the time. Of course 
the note about the limit cases should be present.

> 
> provides 27 meaningful characters, enough for probably 99% of e-mail
> addresses.  A 256-bit alpha index would provide 54 meaningful
> characters but using the post-detect: on a 128-bit is a better choice;
> since even that will probably only detect through one element 99% of
> the time).
> 
> Please don't say "but alpha does not support the @ or . character."
> To maximize efficiency, you really need to make your own
> MagmaEmailIndex subclass which defines its own character map and uses
> it appropriately and efficiently.
> 
Same case here. If you are only searching on an attribute, the code 
should be easy. I agree that will be cases when a subclass of MagmaIndex 
will be unavoidable, but not in the common use case.

> Or, another thing you could do is break apart the email into three
> separate entries and small key-space indexes for all three:
> 
>   miguel at gmail.com
> 
> becomes the entries #('miguel' 'gmail' 'com') and index each user at
> all three.  Then, to find your user you could simply perform an
> appropriately conjuncted where:
> 
>   myUsersMagmaCollectoin where: [ : each | (each first = 'miguel') &
> (each second = 'gmail') & (each third = 'com') ]
> 
> There are other solutions to be sure...
> 
>   - Chris
Thank you very much for your explanation, this has enlighted me a lot 
and I think that to others too.
Again, your work it is amazing and I will be using it without a single 
doubt. Until now has been a pleasure and the development boost hasn't 
been achieved with any other database.

Miguel Cobá

> 
> 
> 2009/3/31 Miguel Enrique Cobá Martínez <miguel.coba at gmail.com>:
>> It is not clear from the magmaseaside tutorial, but the code from
>> http://wiki.squeak.org/squeak/6021:
>>
>> initialize
>>        | users |
>>        users := MagmaCollection new.
>>        users addIndex: (MaSearchStringIndex attribute: #email) beAscii.
>>       self at: #users put: users
>>
>> findUserByEmail: anEmail
>>        ^ (self users where: [ :each | each email equals: anEmail ] )
>> firstOrNil
>>
>> without any doubt suggests that the where: method and the
>>
>> each email equals: anEmail
>>
>> gives a *exact* or *equal* match, but that is not the case.
>> In fact, the where: send returns a MagmaCollectionReader that stands for the
>> *set* or *collection* of objects that matched the equals: method in direct
>> relation with the index created for the MagmaCollection.
>>
>> In this example, the index is created with the default (no keySize:
>> especified) of 32 bits that merely gives you 4 meaningful characters when
>> searching for a string, i.e. if you have users with emails like:
>>
>> user   email
>> 1    'miguel at domain1.com'
>> 2    'miguel at domain2.com'
>> 3    'miguel.coba at domain3.com'
>>
>> a message send like:
>>
>> findUserByEmail: 'miguel at domain1.com'
>>
>> will give you a MagmaReader that represents the 3 users in the database,
>> because they all share the same 4 initial characters. After that, the
>> firstOrNil message, ensure that the user # 1 will *always* be returned, no
>> matter what argument you are passing to findUserByEmail. So, the answer from
>>
>> findUserByEmail: 'miguel at domain1.com'
>> findUserByEmail: 'miguel at domain2.com'
>> findUserByEmail: 'miguel.coba at domain3.com'
>>
>> will be always user #1.
>>
>> In summary, the method doesn't has a right behavior, because it can't be
>> used for finding a specific user, that is the intended action.
>>
>> After reading the Index documentation from the magma site, it was clear that
>> this MagmaCollectionReader can't give accurate and exact results and by
>> itself it can't be used for finding objects. You *always* have to apply some
>> kind of searching over the already reduced collection represented by
>> MagmaCollectionReader in order to find the *exact* match you are trying to
>> locate.
>>
>> So the code should be something like:
>>
>> findUserByEmail: anEmail
>>
>>  | user |
>>        "Here you are working over the entire magma repo"
>>  user := (self users where: [:each | each email equals: anEmail])
>>           "Here you are working over the reduced set
>>            returned by the where and represented by a
>>            MagmaCollectionReader"
>>              detect: [:each |
>>                "Here you are working on a plain Collection"
>>                each email = anEmail ]
>>              ifNone: [nil]. "
>>        ^ user
>>
>>
>> After changing the code this way, the example correctly can find the users
>> with emails 'miguel at domain1.com', 'miguel at domain2.com' and
>> 'miguel.coba at domain3.com'.
>>
>> Can someone confirm that this is the correct way to use a
>> MagmaCollectionReader?
>>
>> P.D. I tried with a larger keySize: at index creation (I even try 400 bits)
>> but this only postponed the point where the string matching stop working.
>> Also, it is not efficient and with 400, squeak throws an error.
>> So that was not the way to go.
>>
>> Thank for your comments,
>> Miguel Cobá
>> _______________________________________________
>> Magma mailing list
>> Magma at lists.squeakfoundation.org
>> http://lists.squeakfoundation.org/mailman/listinfo/magma
>>
> 



More information about the Magma mailing list