Regex review

Sat Aug 28 15:32:42 UTC 2004

Thanks for your comparison this is really instructive.

Stef

On 30 août 04, at 16:37, danil a. osipchuk wrote:

> Hi all
> I always was frustated a bit by absence of regex capabilites in Squeak  
> and I hope that finally one of libraries will finaly get into the  
> image.
> So, I've read related discussions and tried to figure out what folks  
> are most interested in and what I can do myself.   It seems I can play  
> with libraries a little and evaluate
>    1)Speed     2)General ease of use
>    3)Feature lists
> for some extent as first iteration input to the list.
> I'm not Smalltalk or Regex expert, but anyway - little better than  
> nothing.
> I've loaded both libs (they collided in String a bit, but stayed  
> usable) and started writing methods for simple tasks in parallel.
> System:
>    Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on  
> FreeBSD 4.10,
> Input file:
>    August-2004 squeak-dev mail archive size ~1M.
>
> Both suits are extensivly documented. At first glance it seems that RX  
> (Vassily's version) is more friendly, RE (A.Greenberg's one) - less,  
> maybe as it faced to external library. Both suits seem to be 'feature  
> complete' and common in regexp sense -  RX claims that it passes  
> H.Spencer test suit and RE is based on standard PCRE (and as such has  
> some search flags, importance of which i didn't investigate too hard).  
> Converting between two should't be difficult in most cases.
> I'll present and comment essential blocks of code I've come up solving  
> simple tasks (#millisecondsToRun: #result):
> 1) counting  words in file (pattern: '\w+')
>    RX: self testInFileDo: [:fstream |  matcher matchesOnStream:  
> fstream  do: [:s | num := num + 1]].   #(23296 277863)
>    RE: self testInFileLinesDo: [:str |     num := num +  (matcher  
> collectFrom: str) size].                                   #(11166  
> 277909)
>        Well, first impressions.
>    RX has very nice enumerating protocol with stream support, I solved  
> task in a minute or two.
>    RE support of streams is limited (line-by-line grep) and  I didn't  
> even bother with it. Iterating through patterns in RE is much less  
> powerfull - RE can only collect subgroups. I've spent quite a bit time  
> to find the shortest  way to count words. It turned out that with  
> choosen file my #testFileLinesDo: gives only one line due to inproper  
> line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw).  
> Thus the whole thing is slow. Ok, further I'll switch to custom  
> version:
>       RE: self testInFileLinesLFDo: [:str |     num := num +  (matcher  
> collectFrom: str) size].   Walkback. Failed on empty line - it  
> returned nil instead collection.
>       Sorry, and now:
>           RE: self testInFileLinesLFDo: [:str |     (matcher  
> collectFrom: str) ifNotNilDo: [:col | num := num + col size]].     
> #(5715 277909)
>        New version is four times faster than native Smalltalk  
> implementation (not so much IMHO). There is some difference in result,  
> it seems that one should tweak that mysterious search engine flags  
> (and may be I was wrong stating that code between libs is easy to  
> port).
> 2) Let's take slightly meaningful example - collecting email-like  
> strings from file (pattern:  '\w+@(\w+\.)+\w+'):
>
>    RX: self testInFileDo: [:fstream |     matcher matchesOnStream:  
> fstream  do: [:s | mails add: s]].                      #(98460  
> aSet(483))
>    RE: self testInFileLinesLFDo: [:str |     (matcher collectFrom:  
> str) ifNotNilDo: [:col | mails addAll: col]].          #(2029  
> aSet(483))
>           RE violently outperforms RX on such task (50 times). It  
> tends that RE wins in  processing itself and looses a bit during  
> massive exchanges from/in primitives (just guess). Result set is  
> mainly garbage - if one needs to collect addresses (to offer something  
> invaluable to all dwellers) correct pattern looks like '\w+ at  
> (\w+\.)+\w+' -  RX: #(113588 aSet(143)) RE: #(1961 aSet(143))
>
> 3) Lets change this allready well recognized pattern to something else
>      say   'squeaker at squeak.org'    -> 'squeaker dog squeak point  
> org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering  
> style:
>
>    RX:
>    | matcher |
>    matcher  :=   RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
>    self testInOutFilesDo: [:in :out |
>             matcher copyStream: in to: out
>                translatingMatchesUsing: [:weWillUseMatcherCacheInstead  
> |
>                    String streamContents: [:stream|
>                        stream
>                            nextPutAll: (matcher subexpression: 2);
>                            nextPutAll: ' dog ';
>                            nextPutAll: ((matcher subexpression: 4)  
> copyReplaceAll: '.' with: ' point ');
>                            nextPutAll: (matcher subexpression: 5)    
> ]]].                                       Result:      157837ms  
> md5:c33f76ca533ccaba3c486add18331b9e
>
>    RE:
>    | matcher str|
>    matcher  :=   Re on: '(\w+)( at )(\w+\.)+(\w+)'.
>    self testInOutFilesDo: [:in :out |
>            [in atEnd] whileFalse:
>                [ str := in upTo: Character lf.
>                   out nextPutAll: (matcher search: str andReplace: [:m  
> |
>                    String streamContents: [:stream|
>                        stream
>                            nextPutAll: (m matchAt: 1);
>                            nextPutAll: ' dog ';
>                            nextPutAll: ((m matchAt: 3) copyReplaceAll:  
> '.' with: ' point ');
>                            nextPutAll: (m matchAt: 4)    ]]).
>                    out nextPut: Character lf.]].
>       Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e
>
> *********************************************************************** 
> **************************
> Conclusion.
> I like Vassily's library a lot - it is very, very nice and pleasant in  
> use (comments everywhere, carefully designed interface). But it is  
> slow  and maybe it can slow down Squeak if heavily used (although I  
> don't know how effective string processing in Squeak core is now). It  
> is unsuitable for extensive processing and good for episodical use  
> (validate user input for instance).
> A.Greenberg's library of course is not bad at all - it simply follows  
> external library specifics. For string manipulation it appears to be  
> as handy as Vassily's.
> And it is very, very fast.            Danil
>