Regex review
stéphane ducasse
ducasse at iam.unibe.ch
Sat Aug 28 15:32:42 UTC 2004
Thanks for your comparison this is really instructive.
Stef
On 30 août 04, at 16:37, danil a. osipchuk wrote:
> Hi all
> I always was frustated a bit by absence of regex capabilites in Squeak
> and I hope that finally one of libraries will finaly get into the
> image.
> So, I've read related discussions and tried to figure out what folks
> are most interested in and what I can do myself. It seems I can play
> with libraries a little and evaluate
> 1)Speed 2)General ease of use
> 3)Feature lists
> for some extent as first iteration input to the list.
> I'm not Smalltalk or Regex expert, but anyway - little better than
> nothing.
> I've loaded both libs (they collided in String a bit, but stayed
> usable) and started writing methods for simple tasks in parallel.
> System:
> Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on
> FreeBSD 4.10,
> Input file:
> August-2004 squeak-dev mail archive size ~1M.
>
> Both suits are extensivly documented. At first glance it seems that RX
> (Vassily's version) is more friendly, RE (A.Greenberg's one) - less,
> maybe as it faced to external library. Both suits seem to be 'feature
> complete' and common in regexp sense - RX claims that it passes
> H.Spencer test suit and RE is based on standard PCRE (and as such has
> some search flags, importance of which i didn't investigate too hard).
> Converting between two should't be difficult in most cases.
> I'll present and comment essential blocks of code I've come up solving
> simple tasks (#millisecondsToRun: #result):
> 1) counting words in file (pattern: '\w+')
> RX: self testInFileDo: [:fstream | matcher matchesOnStream:
> fstream do: [:s | num := num + 1]]. #(23296 277863)
> RE: self testInFileLinesDo: [:str | num := num + (matcher
> collectFrom: str) size]. #(11166
> 277909)
> Well, first impressions.
> RX has very nice enumerating protocol with stream support, I solved
> task in a minute or two.
> RE support of streams is limited (line-by-line grep) and I didn't
> even bother with it. Iterating through patterns in RE is much less
> powerfull - RE can only collect subgroups. I've spent quite a bit time
> to find the shortest way to count words. It turned out that with
> choosen file my #testFileLinesDo: gives only one line due to inproper
> line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw).
> Thus the whole thing is slow. Ok, further I'll switch to custom
> version:
> RE: self testInFileLinesLFDo: [:str | num := num + (matcher
> collectFrom: str) size]. Walkback. Failed on empty line - it
> returned nil instead collection.
> Sorry, and now:
> RE: self testInFileLinesLFDo: [:str | (matcher
> collectFrom: str) ifNotNilDo: [:col | num := num + col size]].
> #(5715 277909)
> New version is four times faster than native Smalltalk
> implementation (not so much IMHO). There is some difference in result,
> it seems that one should tweak that mysterious search engine flags
> (and may be I was wrong stating that code between libs is easy to
> port).
> 2) Let's take slightly meaningful example - collecting email-like
> strings from file (pattern: '\w+@(\w+\.)+\w+'):
>
> RX: self testInFileDo: [:fstream | matcher matchesOnStream:
> fstream do: [:s | mails add: s]]. #(98460
> aSet(483))
> RE: self testInFileLinesLFDo: [:str | (matcher collectFrom:
> str) ifNotNilDo: [:col | mails addAll: col]]. #(2029
> aSet(483))
> RE violently outperforms RX on such task (50 times). It
> tends that RE wins in processing itself and looses a bit during
> massive exchanges from/in primitives (just guess). Result set is
> mainly garbage - if one needs to collect addresses (to offer something
> invaluable to all dwellers) correct pattern looks like '\w+ at
> (\w+\.)+\w+' - RX: #(113588 aSet(143)) RE: #(1961 aSet(143))
>
> 3) Lets change this allready well recognized pattern to something else
> say 'squeaker at squeak.org' -> 'squeaker dog squeak point
> org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering
> style:
>
> RX:
> | matcher |
> matcher := RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
> self testInOutFilesDo: [:in :out |
> matcher copyStream: in to: out
> translatingMatchesUsing: [:weWillUseMatcherCacheInstead
> |
> String streamContents: [:stream|
> stream
> nextPutAll: (matcher subexpression: 2);
> nextPutAll: ' dog ';
> nextPutAll: ((matcher subexpression: 4)
> copyReplaceAll: '.' with: ' point ');
> nextPutAll: (matcher subexpression: 5)
> ]]]. Result: 157837ms
> md5:c33f76ca533ccaba3c486add18331b9e
>
> RE:
> | matcher str|
> matcher := Re on: '(\w+)( at )(\w+\.)+(\w+)'.
> self testInOutFilesDo: [:in :out |
> [in atEnd] whileFalse:
> [ str := in upTo: Character lf.
> out nextPutAll: (matcher search: str andReplace: [:m
> |
> String streamContents: [:stream|
> stream
> nextPutAll: (m matchAt: 1);
> nextPutAll: ' dog ';
> nextPutAll: ((m matchAt: 3) copyReplaceAll:
> '.' with: ' point ');
> nextPutAll: (m matchAt: 4) ]]).
> out nextPut: Character lf.]].
> Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e
>
> ***********************************************************************
> **************************
> Conclusion.
> I like Vassily's library a lot - it is very, very nice and pleasant in
> use (comments everywhere, carefully designed interface). But it is
> slow and maybe it can slow down Squeak if heavily used (although I
> don't know how effective string processing in Squeak core is now). It
> is unsuitable for extensive processing and good for episodical use
> (validate user input for instance).
> A.Greenberg's library of course is not bad at all - it simply follows
> external library specifics. For string manipulation it appears to be
> as handy as Vassily's.
> And it is very, very fast. Danil
>
More information about the Squeak-dev
mailing list
|