Finding and indexing 'similar' string

Julian Fitzell julian at beta4.com
Tue Aug 26 16:54:54 UTC 2003


Jim Menard wrote:
> Chris,
> 
> On Tuesday, August 26, 2003, at 12:01  PM, Chris Muller wrote:
> 
>> Jim Menard wrote:
>>
>>> How about using the Soundex algorithm? A quick Google search found this
>>> brief explanation <http://www.frontiernet.net/~rjacob/soundex.htm>
>>
>>
>> Ohhh!  Thank you Jim!  What a simple, well-explained method for a 
>> sounds-like
>> index.  This would be a great new index type for MagmaCollections..
>>
>> Do you know whether it works for other keywords?  Or just Surnames?  I 
>> would
>> think it would, since some people's surname are regular words anyway..
> 
> 
> It works for any words because it is based on how they sound. I have 
> read about one problem with the algorithm, though: you need different 
> sets of characters and weightings for different languages. For example, 
> I think you would want "j" and "h" to map to the same sound in Mexican 
> Spanish. (Forgive me if that's a bad example. The only Spanish I've ever 
> learned was "May I have another beer, please?" and "Where is the 
> bathroom?")
> 
> Jim

The other problem with it, as I recall, is that you the first letter 
needs to be the same.  So a name/word that starts with 'ph' won't ever 
match a word that starts with 'f', for example, even if they sound the 
same.  Other than that, though, it works great: we used it for a sales 
system and it allowed users to stop asking people to spell their names 
over the phone.  I've tried typing in every convoluted spelling of my 
name I can think of and it always finds me :)

Julian



More information about the Squeak-dev mailing list