[squeak-dev] [slightly OT] Searching List of items to find abbreviations

David Zmick dz0004455 at gmail.com
Sat Jul 11 03:40:37 UTC 2009


I am doing this python, but I would like you advice on it.

I am given a list like:
1: JCT BOX JS27 T 88-F-1
2: JCT BOX JS-2713 EE-15.2
3: JCT BOX JS32 H 116 A C 14.5
4: JCT BOX JS28 T 120-N-11
5: JCT BOX JS28 T-120-N-11
6: JCT BOX JS32 H 116 A C-14.5
7: JUNCTION BOX JS32 H 116 A C-14.5
and i need to find "similar" items, i have already written the part of my
script to find the similar items, for example, line 4 and 5 are similar. i
used a simple Regular Expression that was generated after each line.

so say you line was:
ASD 123
the regex would be
A[!-/\s]?S[!-/\s]?D[!-/\s]?1[!-/\s]?2[!-/\s]?3[!-/\s]?
this finds anything that may be similar due to punctuation.

the next step is to find lines that are similar based on abbreviations, so I
would be able to match lines with JUNCTION and JCT, then check the results
from that match against the results from the first match and find the most
likely candidates for similarities.  I have tried this:

use a regular expression built from the letters in an abbreviation, eg, JCT
would look like J.*[C]?.*[T]? so that the expression would find anything
that had those letters in it in that order, with anything in between, but
that does not work, any ideas?

-- 
David Zmick
/dz0004455\
http://david-zmick.co.cc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20090710/32ae1164/attachment.htm


More information about the Squeak-dev mailing list