Regex review
Lic. Edgar J. De Cleene
edgardec2001 at yahoo.com.ar
Sun Aug 29 10:20:25 UTC 2004
On 30/08/04 11:37, "danil a. osipchuk" <danil at tsnet.ru> wrote:
> Hi all
> I always was frustated a bit by absence of regex capabilites in Squeak
> and I hope that finally one of libraries will finaly get into the image.
> So, I've read related discussions and tried to figure out what folks are
> most interested in and what I can do myself. It seems I can play with
> libraries a little and evaluate
> 1)Speed
> 2)General ease of use
> 3)Feature lists
> for some extent as first iteration input to the list.
> I'm not Smalltalk or Regex expert, but anyway - little better than nothing.
> I've loaded both libs (they collided in String a bit, but stayed usable)
> and started writing methods for simple tasks in parallel.
> System:
> Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on
> FreeBSD 4.10,
> Input file:
> August-2004 squeak-dev mail archive size ~1M.
>
> Both suits are extensivly documented. At first glance it seems that RX
> (Vassily's version) is more friendly, RE (A.Greenberg's one) - less,
> maybe as it faced to external library. Both suits seem to be 'feature
> complete' and common in regexp sense - RX claims that it passes
> H.Spencer test suit and RE is based on standard PCRE (and as such has
> some search flags, importance of which i didn't investigate too hard).
> Converting between two should't be difficult in most cases.
> I'll present and comment essential blocks of code I've come up solving
> simple tasks (#millisecondsToRun: #result):
> 1) counting words in file (pattern: '\w+')
> RX: self testInFileDo: [:fstream | matcher matchesOnStream:
> fstream do: [:s | num := num + 1]]. #(23296 277863)
> RE: self testInFileLinesDo: [:str | num := num + (matcher
> collectFrom: str) size]. #(11166 277909)
> Well, first impressions.
> RX has very nice enumerating protocol with stream support, I solved
> task in a minute or two.
> RE support of streams is limited (line-by-line grep) and I didn't
> even bother with it. Iterating through patterns in RE is much less
> powerfull - RE can only collect subgroups. I've spent quite a bit time
> to find the shortest way to count words. It turned out that with
> choosen file my #testFileLinesDo: gives only one line due to inproper
> line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw).
> Thus the whole thing is slow. Ok, further I'll switch to custom version:
>
> RE: self testInFileLinesLFDo: [:str | num := num + (matcher
> collectFrom: str) size]. Walkback. Failed on empty line - it returned
> nil instead collection.
>
> Sorry, and now:
>
> RE: self testInFileLinesLFDo: [:str | (matcher collectFrom: str)
> ifNotNilDo: [:col | num := num + col size]]. #(5715 277909)
>
> New version is four times faster than native Smalltalk implementation
> (not so much IMHO). There is some difference in result, it seems that
> one should tweak that mysterious search engine flags (and may be I was
> wrong stating that code between libs is easy to port).
> 2) Let's take slightly meaningful example - collecting email-like
> strings from file (pattern: '\w+@(\w+\.)+\w+'):
>
> RX: self testInFileDo: [:fstream | matcher matchesOnStream:
> fstream do: [:s | mails add: s]]. #(98460 aSet(483))
> RE: self testInFileLinesLFDo: [:str | (matcher collectFrom: str)
> ifNotNilDo: [:col | mails addAll: col]]. #(2029 aSet(483))
>
> RE violently outperforms RX on such task (50 times). It tends that
> RE wins in processing itself and looses a bit during massive exchanges
> from/in primitives (just guess). Result set is mainly garbage - if one
> needs to collect addresses (to offer something invaluable to all
> dwellers) correct pattern looks like '\w+ at (\w+\.)+\w+' - RX:
> #(113588 aSet(143)) RE: #(1961 aSet(143))
>
> 3) Lets change this allready well recognized pattern to something else
> say 'squeaker at squeak.org' -> 'squeaker dog squeak point
> org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering
> style:
>
> RX:
> | matcher |
> matcher := RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
> self testInOutFilesDo: [:in :out |
> matcher copyStream: in to: out
> translatingMatchesUsing: [:weWillUseMatcherCacheInstead |
> String streamContents: [:stream|
> stream
> nextPutAll: (matcher subexpression: 2);
> nextPutAll: ' dog ';
> nextPutAll: ((matcher subexpression: 4)
> copyReplaceAll: '.' with: ' point ');
> nextPutAll: (matcher subexpression: 5)
> ]]].
>
> Result: 157837ms md5:c33f76ca533ccaba3c486add18331b9e
>
> RE:
> | matcher str|
> matcher := Re on: '(\w+)( at )(\w+\.)+(\w+)'.
> self testInOutFilesDo: [:in :out |
> [in atEnd] whileFalse:
> [ str := in upTo: Character lf.
> out nextPutAll: (matcher search: str andReplace: [:m |
> String streamContents: [:stream|
> stream
> nextPutAll: (m matchAt: 1);
> nextPutAll: ' dog ';
> nextPutAll: ((m matchAt: 3) copyReplaceAll:
> '.' with: ' point ');
> nextPutAll: (m matchAt: 4) ]]).
> out nextPut: Character lf.]].
>
> Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e
>
> ******************************************************************************
> *******************
> Conclusion.
> I like Vassily's library a lot - it is very, very nice and pleasant in
> use (comments everywhere, carefully designed interface). But it is slow
> and maybe it can slow down Squeak if heavily used (although I don't know
> how effective string processing in Squeak core is now). It is unsuitable
> for extensive processing and good for episodical use (validate user
> input for instance).
> A.Greenberg's library of course is not bad at all - it simply follows
> external library specifics. For string manipulation it appears to be as
> handy as Vassily's.
> And it is very, very fast.
>
> Danil
>
Danil:
I raise a BRAVO for you.
So , as I said before , now we have data for opinions about two regex.
And maybe the next champion could combine this two useful pieces in a gold
medal winner.
Edgar
More information about the Squeak-dev
mailing list
|