MagmaCollectionReader behavior

Wed Apr 1 15:58:22 UTC 2009

Hi again!  I'd just like to say one more thing; In Smalltalk, Strings
do not have any upper-bound size-limit, so they are as flexible as a
CLOBS in a RDBMS.  The premise of your indexing comparison would seem
require an RDBMS to index and search entire CLOB values, which I don't
think they can do without application-developer help.  (Please correct
me if I'm wrong, I haven't used RDBMS in years..)

OTOH, if you define a indexed VARCHAR column, you are specifying an
upper-bound on the size of the strings it can index (not to mention,
store!).  To me, this is analagous to specifying a key-size..

If you are willing in your application, to restrict the length of
e-mail addresses to the length of whatever you would make the VARCHAR
column, then you can choose an appropriate key-size to handle that
length and not have to do the post-detect.

Finally,

> I think that this is a point where the requirements for the developers can
> be a little too much. See it this way, we are only indexing a string. It
> shouldn't be necessary to implement a hash code for converting them to
> numbers.

As you get deeper into it, I hope you'll find the MaByteSequenceIndex
hierarchy included in the base Magma package is flexible enough for
use "out of the box" (but if not, it isn't hard to define your own
custom index type).

But yes, the key point of this whole thread is the post-detect:
required to index "CLOB" values (read: Smalltalk Strings), which is
what I would do rather than restricting the user..

Regards,
  Chris

2009/4/1 Miguel Enrique Cobá Martínez <miguel.coba at gmail.com>:
> Chris Muller wrote:
>
> First, I didn't mean to be rude. I have Magma in a very high regard and it
> is one of the most interesting projects and codebase I have ever learn from
> by reading the code. I just think that the example in the wiki page cited
> can make the users expect a very distinct behavior than the actual one.
>
> See it this way. Most of the developers, including me, have used RDBMs for
> almost all our professional work. The change to an ODB can be at first very
> appealing, and at the same time, very dificult to grasp.
>
> To the point. When we hear about indexes we, without exception, think about
> RDBSs indexes, that we create for a column in a table and then, where
> querying the table and if there is only a record with the value that we are
> looking for, the query returns only that value. No need to especify other
> parameters for this to work. This is the way that we expect them to work.
> Call it inertia, if you want.
>
> Now, suppose that one of this developers, think of me again, want to know
> what's this thing about ODBs, and finds the wiki page mentioned, follows it
> and reads that the collection has an index over the #email instance
> variable. Without knowing the conceptual diferences between the indexes in
> RDBMs and Magma, he (me) will asume that for a single value on the
> collection, a search for that value will give a single result.
>
> That is not the case, and that is the source of my confusion. I think that
> in the wiki page (that for sure is one of the most visited of the magma
> related ones) a warning should be made to the developer pointing that:
>
> - he needs to create the index with a greater keySize in order to alleviate
> the problem with the finite key space and the impact that a low keySize has
> over it.
> - he needs to read in more deep the index wiki page to understand the
> features and the differences with RDBMs indexes.
> - he needs to think about querys in a very distinct way when dealing with
> ODBs than with RDBMs or/and he needs to massage or prepare the data to index
> in order to have the correct results (like the query for the 3 part email
> index you propose below)
>
>
>> Hi Miguel, MagmaCollections can absolutely
>>
>>> give accurate and exact results and by
>>> itself
>>
>
> I don't have any doubt, but depends, in this case of the keySize.
>
>> , *can* be used for finding objects.  You *don't* always
>>
>>> have to apply some
>>> kind of searching over the already reduced collection represented by
>>> MagmaCollectionReader in order to find the *exact* match you are trying
>>> to
>>> locate.
>>
>> The key concept you've stumbled on here is that MagmaCollectionIndexes
>> do only provide a *finite* key space.  But how you decide to utilize
>> that key-space, e.g., convert your objects into an integral key value,
>> as well as the size of the key-space in bits, determines whether
>> duplicate keys will occur or not.
>
> I think that this is a point where the requirements for the developers can
> be a little too much. See it this way, we are only indexing a string. It
> shouldn't be necessary to implement a hash code for converting them to
> numbers. Besides, most of us, or at least me, are not experts in the hashing
> techniques and for sure we can't be certain that the hash is correct or that
> don't clash.
>
>>
>> If you want to use a big, fat, e-mail address as a "unique
>> identifier", you will be better served to use most of a 64 or 128-bit
>> key-space than a small percentage of a 400-bit key-space.  Using only
>> the alpha range:
>>
>>    (MaSearchStringIndex attribute: #email) keySize: 128; beAlpha; yourself
>
> This line it is perfect for the wiki page, instead of the current one,
> because it will give the expected results most of the time. Of course the
> note about the limit cases should be present.
>
>>
>> provides 27 meaningful characters, enough for probably 99% of e-mail
>> addresses.  A 256-bit alpha index would provide 54 meaningful
>> characters but using the post-detect: on a 128-bit is a better choice;
>> since even that will probably only detect through one element 99% of
>> the time).
>>
>> Please don't say "but alpha does not support the @ or . character."
>> To maximize efficiency, you really need to make your own
>> MagmaEmailIndex subclass which defines its own character map and uses
>> it appropriately and efficiently.
>>
> Same case here. If you are only searching on an attribute, the code should
> be easy. I agree that will be cases when a subclass of MagmaIndex will be
> unavoidable, but not in the common use case.
>
>> Or, another thing you could do is break apart the email into three
>> separate entries and small key-space indexes for all three:
>>
>>  miguel at gmail.com
>>
>> becomes the entries #('miguel' 'gmail' 'com') and index each user at
>> all three.  Then, to find your user you could simply perform an
>> appropriately conjuncted where:
>>
>>  myUsersMagmaCollectoin where: [ : each | (each first = 'miguel') &
>> (each second = 'gmail') & (each third = 'com') ]
>>
>> There are other solutions to be sure...
>>
>>  - Chris
>
> Thank you very much for your explanation, this has enlighted me a lot and I
> think that to others too.
> Again, your work it is amazing and I will be using it without a single
> doubt. Until now has been a pleasure and the development boost hasn't been
> achieved with any other database.
>
> Miguel Cobá
>
>>
>>
>> 2009/3/31 Miguel Enrique Cobá Martínez <miguel.coba at gmail.com>:
>>>
>>> It is not clear from the magmaseaside tutorial, but the code from
>>> http://wiki.squeak.org/squeak/6021:
>>>
>>> initialize
>>>       | users |
>>>       users := MagmaCollection new.
>>>       users addIndex: (MaSearchStringIndex attribute: #email) beAscii.
>>>      self at: #users put: users
>>>
>>> findUserByEmail: anEmail
>>>       ^ (self users where: [ :each | each email equals: anEmail ] )
>>> firstOrNil
>>>
>>> without any doubt suggests that the where: method and the
>>>
>>> each email equals: anEmail
>>>
>>> gives a *exact* or *equal* match, but that is not the case.
>>> In fact, the where: send returns a MagmaCollectionReader that stands for
>>> the
>>> *set* or *collection* of objects that matched the equals: method in
>>> direct
>>> relation with the index created for the MagmaCollection.
>>>
>>> In this example, the index is created with the default (no keySize:
>>> especified) of 32 bits that merely gives you 4 meaningful characters when
>>> searching for a string, i.e. if you have users with emails like:
>>>
>>> user   email
>>> 1    'miguel at domain1.com'
>>> 2    'miguel at domain2.com'
>>> 3    'miguel.coba at domain3.com'
>>>
>>> a message send like:
>>>
>>> findUserByEmail: 'miguel at domain1.com'
>>>
>>> will give you a MagmaReader that represents the 3 users in the database,
>>> because they all share the same 4 initial characters. After that, the
>>> firstOrNil message, ensure that the user # 1 will *always* be returned,
>>> no
>>> matter what argument you are passing to findUserByEmail. So, the answer
>>> from
>>>
>>> findUserByEmail: 'miguel at domain1.com'
>>> findUserByEmail: 'miguel at domain2.com'
>>> findUserByEmail: 'miguel.coba at domain3.com'
>>>
>>> will be always user #1.
>>>
>>> In summary, the method doesn't has a right behavior, because it can't be
>>> used for finding a specific user, that is the intended action.
>>>
>>> After reading the Index documentation from the magma site, it was clear
>>> that
>>> this MagmaCollectionReader can't give accurate and exact results and by
>>> itself it can't be used for finding objects. You *always* have to apply
>>> some
>>> kind of searching over the already reduced collection represented by
>>> MagmaCollectionReader in order to find the *exact* match you are trying
>>> to
>>> locate.
>>>
>>> So the code should be something like:
>>>
>>> findUserByEmail: anEmail
>>>
>>>  | user |
>>>       "Here you are working over the entire magma repo"
>>>  user := (self users where: [:each | each email equals: anEmail])
>>>          "Here you are working over the reduced set
>>>           returned by the where and represented by a
>>>           MagmaCollectionReader"
>>>             detect: [:each |
>>>               "Here you are working on a plain Collection"
>>>               each email = anEmail ]
>>>             ifNone: [nil]. "
>>>       ^ user
>>>
>>>
>>> After changing the code this way, the example correctly can find the
>>> users
>>> with emails 'miguel at domain1.com', 'miguel at domain2.com' and
>>> 'miguel.coba at domain3.com'.
>>>
>>> Can someone confirm that this is the correct way to use a
>>> MagmaCollectionReader?
>>>
>>> P.D. I tried with a larger keySize: at index creation (I even try 400
>>> bits)
>>> but this only postponed the point where the string matching stop working.
>>> Also, it is not efficient and with 400, squeak throws an error.
>>> So that was not the way to go.
>>>
>>> Thank for your comments,
>>> Miguel Cobá
>>> _______________________________________________
>>> Magma mailing list
>>> Magma at lists.squeakfoundation.org
>>> http://lists.squeakfoundation.org/mailman/listinfo/magma
>>>
>>
>
>