Regex review

Mon Aug 30 14:37:06 UTC 2004

Hi all
I always was frustated a bit by absence of regex capabilites in Squeak 
and I hope that finally one of libraries will finaly get into the image.
So, I've read related discussions and tried to figure out what folks are 
most interested in and what I can do myself.   It seems I can play with 
libraries a little and evaluate
    1)Speed  
    2)General ease of use
    3)Feature lists
for some extent as first iteration input to the list.
I'm not Smalltalk or Regex expert, but anyway - little better than nothing.
I've loaded both libs (they collided in String a bit, but stayed usable) 
and started writing methods for simple tasks in parallel.
System:
    Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on 
FreeBSD 4.10,
Input file:
    August-2004 squeak-dev mail archive size ~1M.

Both suits are extensivly documented. At first glance it seems that RX 
(Vassily's version) is more friendly, RE (A.Greenberg's one) - less, 
maybe as it faced to external library. Both suits seem to be 'feature 
complete' and common in regexp sense -  RX claims that it passes 
H.Spencer test suit and RE is based on standard PCRE (and as such has 
some search flags, importance of which i didn't investigate too hard). 
Converting between two should't be difficult in most cases.
I'll present and comment essential blocks of code I've come up solving 
simple tasks (#millisecondsToRun: #result):
1) counting  words in file (pattern: '\w+')
    RX: self testInFileDo: [:fstream |  matcher matchesOnStream: 
fstream  do: [:s | num := num + 1]].   #(23296 277863)
    RE: self testInFileLinesDo: [:str |     num := num +  (matcher 
collectFrom: str) size].                                   #(11166 277909)
        Well, first impressions.
    RX has very nice enumerating protocol with stream support, I solved 
task in a minute or two.
    RE support of streams is limited (line-by-line grep) and  I didn't 
even bother with it. Iterating through patterns in RE is much less 
powerfull - RE can only collect subgroups. I've spent quite a bit time 
to find the shortest  way to count words. It turned out that with 
choosen file my #testFileLinesDo: gives only one line due to inproper 
line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw). 
Thus the whole thing is slow. Ok, further I'll switch to custom version:

    RE: self testInFileLinesLFDo: [:str |     num := num +  (matcher 
collectFrom: str) size].   Walkback. Failed on empty line - it returned 
nil instead collection.

    Sorry, and now:

    RE: self testInFileLinesLFDo: [:str |     (matcher collectFrom: str) 
ifNotNilDo: [:col | num := num + col size]].    #(5715 277909)

New version is four times faster than native Smalltalk implementation 
(not so much IMHO). There is some difference in result, it seems that 
one should tweak that mysterious search engine flags (and may be I was 
wrong stating that code between libs is easy to port).
2) Let's take slightly meaningful example - collecting email-like 
strings from file (pattern:  '\w+@(\w+\.)+\w+'):

    RX: self testInFileDo: [:fstream |     matcher matchesOnStream: 
fstream  do: [:s | mails add: s]].                      #(98460 aSet(483))
    RE: self testInFileLinesLFDo: [:str |     (matcher collectFrom: str) 
ifNotNilDo: [:col | mails addAll: col]].          #(2029 aSet(483))

    RE violently outperforms RX on such task (50 times). It tends that 
RE wins in  processing itself and looses a bit during massive exchanges 
from/in primitives (just guess). Result set is mainly garbage - if one 
needs to collect addresses (to offer something invaluable to all 
dwellers) correct pattern looks like '\w+ at (\w+\.)+\w+' -  RX: 
#(113588 aSet(143)) RE: #(1961 aSet(143))

3) Lets change this allready well recognized pattern to something else
      say   'squeaker at squeak.org'    -> 'squeaker dog squeak point 
org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering 
style:

    RX:
    | matcher |
    matcher  :=   RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
    self testInOutFilesDo: [:in :out |
             matcher copyStream: in to: out
                translatingMatchesUsing: [:weWillUseMatcherCacheInstead |
                    String streamContents: [:stream|
                        stream
                            nextPutAll: (matcher subexpression: 2);
                            nextPutAll: ' dog ';
                            nextPutAll: ((matcher subexpression: 4) 
copyReplaceAll: '.' with: ' point ');
                            nextPutAll: (matcher subexpression: 5)   
 ]]].                 

    Result:      157837ms md5:c33f76ca533ccaba3c486add18331b9e

    RE:
    | matcher str|
    matcher  :=   Re on: '(\w+)( at )(\w+\.)+(\w+)'.
    self testInOutFilesDo: [:in :out |
            [in atEnd] whileFalse:
                [ str := in upTo: Character lf.
                   out nextPutAll: (matcher search: str andReplace: [:m |
                    String streamContents: [:stream|
                        stream
                            nextPutAll: (m matchAt: 1);
                            nextPutAll: ' dog ';
                            nextPutAll: ((m matchAt: 3) copyReplaceAll: 
'.' with: ' point ');
                            nextPutAll: (m matchAt: 4)    ]]).
                    out nextPut: Character lf.]].

    Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e

*************************************************************************************************
Conclusion.
I like Vassily's library a lot - it is very, very nice and pleasant in 
use (comments everywhere, carefully designed interface). But it is slow  
and maybe it can slow down Squeak if heavily used (although I don't know 
how effective string processing in Squeak core is now). It is unsuitable 
for extensive processing and good for episodical use (validate user 
input for instance).
A.Greenberg's library of course is not bad at all - it simply follows 
external library specifics. For string manipulation it appears to be as 
handy as Vassily's.
And it is very, very fast.  

        Danil