Regex review

Sun Aug 29 10:20:25 UTC 2004

On 30/08/04 11:37, "danil a. osipchuk" <danil at tsnet.ru> wrote:

> Hi all
> I always was frustated a bit by absence of regex capabilites in Squeak
> and I hope that finally one of libraries will finaly get into the image.
> So, I've read related discussions and tried to figure out what folks are
> most interested in and what I can do myself.   It seems I can play with
> libraries a little and evaluate
>   1)Speed  
>   2)General ease of use
>   3)Feature lists
> for some extent as first iteration input to the list.
> I'm not Smalltalk or Regex expert, but anyway - little better than nothing.
> I've loaded both libs (they collided in String a bit, but stayed usable)
> and started writing methods for simple tasks in parallel.
> System:
>   Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on
> FreeBSD 4.10,
> Input file:
>   August-2004 squeak-dev mail archive size ~1M.
> 
> Both suits are extensivly documented. At first glance it seems that RX
> (Vassily's version) is more friendly, RE (A.Greenberg's one) - less,
> maybe as it faced to external library. Both suits seem to be 'feature
> complete' and common in regexp sense -  RX claims that it passes
> H.Spencer test suit and RE is based on standard PCRE (and as such has
> some search flags, importance of which i didn't investigate too hard).
> Converting between two should't be difficult in most cases.
> I'll present and comment essential blocks of code I've come up solving
> simple tasks (#millisecondsToRun: #result):
> 1) counting  words in file (pattern: '\w+')
>   RX: self testInFileDo: [:fstream |  matcher matchesOnStream:
> fstream  do: [:s | num := num + 1]].   #(23296 277863)
>   RE: self testInFileLinesDo: [:str |     num := num +  (matcher
> collectFrom: str) size].                                   #(11166 277909)
>       Well, first impressions.
>   RX has very nice enumerating protocol with stream support, I solved
> task in a minute or two.
>   RE support of streams is limited (line-by-line grep) and  I didn't
> even bother with it. Iterating through patterns in RE is much less
> powerfull - RE can only collect subgroups. I've spent quite a bit time
> to find the shortest  way to count words. It turned out that with
> choosen file my #testFileLinesDo: gives only one line due to inproper
> line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw).
> Thus the whole thing is slow. Ok, further I'll switch to custom version:
>   
>   RE: self testInFileLinesLFDo: [:str |     num := num +  (matcher
> collectFrom: str) size].   Walkback. Failed on empty line - it returned
> nil instead collection.
>   
>   Sorry, and now:
>       
>   RE: self testInFileLinesLFDo: [:str |     (matcher collectFrom: str)
> ifNotNilDo: [:col | num := num + col size]].    #(5715 277909)
>       
> New version is four times faster than native Smalltalk implementation
> (not so much IMHO). There is some difference in result, it seems that
> one should tweak that mysterious search engine flags (and may be I was
> wrong stating that code between libs is easy to port).
> 2) Let's take slightly meaningful example - collecting email-like
> strings from file (pattern:  '\w+@(\w+\.)+\w+'):
> 
>   RX: self testInFileDo: [:fstream |     matcher matchesOnStream:
> fstream  do: [:s | mails add: s]].                      #(98460 aSet(483))
>   RE: self testInFileLinesLFDo: [:str |     (matcher collectFrom: str)
> ifNotNilDo: [:col | mails addAll: col]].          #(2029 aSet(483))
>       
>   RE violently outperforms RX on such task (50 times). It tends that
> RE wins in  processing itself and looses a bit during massive exchanges
> from/in primitives (just guess). Result set is mainly garbage - if one
> needs to collect addresses (to offer something invaluable to all
> dwellers) correct pattern looks like '\w+ at (\w+\.)+\w+' -  RX:
> #(113588 aSet(143)) RE: #(1961 aSet(143))
> 
> 3) Lets change this allready well recognized pattern to something else
>     say   'squeaker at squeak.org'    -> 'squeaker dog squeak point
> org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering
> style:
> 
>   RX:
>   | matcher |
>   matcher  :=   RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
>   self testInOutFilesDo: [:in :out |
>            matcher copyStream: in to: out
>               translatingMatchesUsing: [:weWillUseMatcherCacheInstead |
>                   String streamContents: [:stream|
>                       stream
>                           nextPutAll: (matcher subexpression: 2);
>                           nextPutAll: ' dog ';
>                           nextPutAll: ((matcher subexpression: 4)
> copyReplaceAll: '.' with: ' point ');
>                           nextPutAll: (matcher subexpression: 5)
> ]]].             
>                  
>   Result:      157837ms md5:c33f76ca533ccaba3c486add18331b9e
> 
>   RE:
>   | matcher str|
>   matcher  :=   Re on: '(\w+)( at )(\w+\.)+(\w+)'.
>   self testInOutFilesDo: [:in :out |
>           [in atEnd] whileFalse:
>               [ str := in upTo: Character lf.
>                  out nextPutAll: (matcher search: str andReplace: [:m |
>                   String streamContents: [:stream|
>                       stream
>                           nextPutAll: (m matchAt: 1);
>                           nextPutAll: ' dog ';
>                           nextPutAll: ((m matchAt: 3) copyReplaceAll:
> '.' with: ' point ');
>                           nextPutAll: (m matchAt: 4)    ]]).
>                   out nextPut: Character lf.]].
>   
>   Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e
> 
> ******************************************************************************
> *******************
> Conclusion.
> I like Vassily's library a lot - it is very, very nice and pleasant in
> use (comments everywhere, carefully designed interface). But it is slow
> and maybe it can slow down Squeak if heavily used (although I don't know
> how effective string processing in Squeak core is now). It is unsuitable
> for extensive processing and good for episodical use (validate user
> input for instance).
> A.Greenberg's library of course is not bad at all - it simply follows
> external library specifics. For string manipulation it appears to be as
> handy as Vassily's.
> And it is very, very fast.
>   
>       Danil
> 
Danil:

I raise a BRAVO for you.
So , as I said before , now we have data for opinions about two regex.

And maybe the next champion could combine this two useful pieces in a gold
medal winner.

Edgar