Searching magma string indices

Brent Pinkney brent at zamail.co.za
Tue Mar 27 23:19:15 UTC 2007


Hi Chris,

The last time we spoke we were trying to support the equivent of String >> #match: in Magma. This is equivalent to 'like' expression in Lava/SQL.

We made the following observations. I describe the problem I think we will have and possible solution is suggested afterwards with my issues.

1. Proper Prefixes
-------------------
Expressions of the sort: where: [ :p | p familyName match:  'foo*' ] can be supported by the existing MaSearchStringIndex by mapping the clause to 

	(familyName from: 'foo' to: 'foo' maAlphabeticalNext).

2. Proper Suffixes
-------------------
Expressions of the sort: where: [ :p | p familyName match:  '*bar' ] can be supported a subclass of MaSearchStringIndex, which adds two hashes for each string in the
MagmaCollection: the normal hash for the string, and an addtional one for the string reversed. The query would be mapped to 

	(familyName from: 'foo' reversed to: 'foo' reversed maAlphabeticalNext).

3. Complex Wildcards
------------------------
Expressions of the sort: where: [ :p | p familyName match:  '*foo*bar*' ] can be supported by adding hashes for each proper substring of the entry.
e.g.
	To hash 'foo to bart' we would add hashes for:
		'foo to bart' 
		'oo to bart' 
		'o to bart' 
		' to bart' 
		'to bart' 
		'o bart' 
		' bart' 
		'bart' 
		'art' 
		'rt' 
		't' 

The query '*foo*bar*' would be tranformed into the intersection of:

	(familyName from: 'foo' to: 'foo' maAlphabeticalNext)
	& (familyName from: 'bar' to: 'bar' maAlphabeticalNext)

4. Single Character Wildcards
---------------------------------
Expressions of the sort: where: [ :p | p familyName match:  'foo#bar' ] can be supported by adding hashes which are a function of each characters value and position in the string.
e.g.
	To hash 'foo' we would add hashes for:
		(256 * 1) + $f asciiValue
		(256 * 1) + $f asciiValue
		(256 * 1) + $f asciiValue

Whenever a single-character-matching string is specified, every character other than the $# characgters becomes a conjunction in the query.  

The query '*foo*bar*' would be tranformed into the intersection of

	(familyName = ((256*1) + $f asciiValue)
		& (familyName = ((256*2) + $o asciiValue)
			& (familyName = ((256*3) + $o asciiValue)
				& (familyName = ((256*5) + $b asciiValue)
					& (familyName = ((256*5) + $a asciiValue)
						& (familyName = ((256*5) + $r asciiValue)


THE PROBLEM
------------------
The naive solution to this is to create a new subclass of MaSearchStringIndex (say MaMatchingStringIndex) which addes hashes for all the scenarios above into one MaHashIndex.

If the collection with this index has two elements 'footobar' and 'bartobaz' then the expression 'bar*' will incorrectly match to BOTH elements! 

There will be a hash entry for the prefix 'bar' --> 'footobar' by 3) above in addition to the normal hash entry for 'bartobaz'.



THE SOLUTION
------------------
My current solution is for the new MaMatchingStringIndex class to add THREE indexes to a collection instead of the normal one index one would expect from the sample:

	myCollection addIndex: (MaMatchingStringIndex new attribute: #familyName) ....

The first index would add two hashes for each string in the collection: one for the string and one for the reversed string. This would support 1) and 2) above.
The second index would add hashes for the complex wildcard patterns in 3) above.
The third index would add hashes for the single character wildcard patterns in 4) above.

ISSUES
---------
1) Is there no simpler solution I am missing.
2) If not, Magma would have to change #addIndex: to double dispatch back to the MaCollectionIndex so it can add itself (or three indexes) to the collection.
3) Are there issues with the -keys indexes that Magma addes by itself for each index ?
4) Is this stuff you would concider adding to Magma or is it for Lava only ?


If I am reincarnated, I am going to do something easy like brain surgery.


Brent


More information about the Magma mailing list