Regex review
danil a. osipchuk
danil at tsnet.ru
Mon Aug 30 14:37:06 UTC 2004
Hi all
I always was frustated a bit by absence of regex capabilites in Squeak
and I hope that finally one of libraries will finaly get into the image.
So, I've read related discussions and tried to figure out what folks are
most interested in and what I can do myself. It seems I can play with
libraries a little and evaluate
1)Speed
2)General ease of use
3)Feature lists
for some extent as first iteration input to the list.
I'm not Smalltalk or Regex expert, but anyway - little better than nothing.
I've loaded both libs (they collided in String a bit, but stayed usable)
and started writing methods for simple tasks in parallel.
System:
Celeron 1.7MHz 384DRAM running Unix VM 3.6-3, Squeak 3.7 #5976 on
FreeBSD 4.10,
Input file:
August-2004 squeak-dev mail archive size ~1M.
Both suits are extensivly documented. At first glance it seems that RX
(Vassily's version) is more friendly, RE (A.Greenberg's one) - less,
maybe as it faced to external library. Both suits seem to be 'feature
complete' and common in regexp sense - RX claims that it passes
H.Spencer test suit and RE is based on standard PCRE (and as such has
some search flags, importance of which i didn't investigate too hard).
Converting between two should't be difficult in most cases.
I'll present and comment essential blocks of code I've come up solving
simple tasks (#millisecondsToRun: #result):
1) counting words in file (pattern: '\w+')
RX: self testInFileDo: [:fstream | matcher matchesOnStream:
fstream do: [:s | num := num + 1]]. #(23296 277863)
RE: self testInFileLinesDo: [:str | num := num + (matcher
collectFrom: str) size]. #(11166 277909)
Well, first impressions.
RX has very nice enumerating protocol with stream support, I solved
task in a minute or two.
RE support of streams is limited (line-by-line grep) and I didn't
even bother with it. Iterating through patterns in RE is much less
powerfull - RE can only collect subgroups. I've spent quite a bit time
to find the shortest way to count words. It turned out that with
choosen file my #testFileLinesDo: gives only one line due to inproper
line endings (the same with Re>>grepFrom:to:onMatch:onNonMatch: btw).
Thus the whole thing is slow. Ok, further I'll switch to custom version:
RE: self testInFileLinesLFDo: [:str | num := num + (matcher
collectFrom: str) size]. Walkback. Failed on empty line - it returned
nil instead collection.
Sorry, and now:
RE: self testInFileLinesLFDo: [:str | (matcher collectFrom: str)
ifNotNilDo: [:col | num := num + col size]]. #(5715 277909)
New version is four times faster than native Smalltalk implementation
(not so much IMHO). There is some difference in result, it seems that
one should tweak that mysterious search engine flags (and may be I was
wrong stating that code between libs is easy to port).
2) Let's take slightly meaningful example - collecting email-like
strings from file (pattern: '\w+@(\w+\.)+\w+'):
RX: self testInFileDo: [:fstream | matcher matchesOnStream:
fstream do: [:s | mails add: s]]. #(98460 aSet(483))
RE: self testInFileLinesLFDo: [:str | (matcher collectFrom: str)
ifNotNilDo: [:col | mails addAll: col]]. #(2029 aSet(483))
RE violently outperforms RX on such task (50 times). It tends that
RE wins in processing itself and looses a bit during massive exchanges
from/in primitives (just guess). Result set is mainly garbage - if one
needs to collect addresses (to offer something invaluable to all
dwellers) correct pattern looks like '\w+ at (\w+\.)+\w+' - RX:
#(113588 aSet(143)) RE: #(1961 aSet(143))
3) Lets change this allready well recognized pattern to something else
say 'squeaker at squeak.org' -> 'squeaker dog squeak point
org' ('dog' is Russian alias for @). Sorry for a bit crazy engineering
style:
RX:
| matcher |
matcher := RxMatcher forString: '(\w+)( at )(\w+\.)+(\w+)'.
self testInOutFilesDo: [:in :out |
matcher copyStream: in to: out
translatingMatchesUsing: [:weWillUseMatcherCacheInstead |
String streamContents: [:stream|
stream
nextPutAll: (matcher subexpression: 2);
nextPutAll: ' dog ';
nextPutAll: ((matcher subexpression: 4)
copyReplaceAll: '.' with: ' point ');
nextPutAll: (matcher subexpression: 5)
]]].
Result: 157837ms md5:c33f76ca533ccaba3c486add18331b9e
RE:
| matcher str|
matcher := Re on: '(\w+)( at )(\w+\.)+(\w+)'.
self testInOutFilesDo: [:in :out |
[in atEnd] whileFalse:
[ str := in upTo: Character lf.
out nextPutAll: (matcher search: str andReplace: [:m |
String streamContents: [:stream|
stream
nextPutAll: (m matchAt: 1);
nextPutAll: ' dog ';
nextPutAll: ((m matchAt: 3) copyReplaceAll:
'.' with: ' point ');
nextPutAll: (m matchAt: 4) ]]).
out nextPut: Character lf.]].
Result: 3056ms md5:c33f76ca533ccaba3c486add18331b9e
*************************************************************************************************
Conclusion.
I like Vassily's library a lot - it is very, very nice and pleasant in
use (comments everywhere, carefully designed interface). But it is slow
and maybe it can slow down Squeak if heavily used (although I don't know
how effective string processing in Squeak core is now). It is unsuitable
for extensive processing and good for episodical use (validate user
input for instance).
A.Greenberg's library of course is not bad at all - it simply follows
external library specifics. For string manipulation it appears to be as
handy as Vassily's.
And it is very, very fast.
Danil
More information about the Squeak-dev
mailing list
|